Quantcast
Channel: AlwaysOn Professional
Viewing all 58 articles
Browse latest View live

Reducing Loss of Connectivity when Recreating an Availability Group

$
0
0

When you drop an availability group, the listener resource is also dropped and interrupts application connectivity to the availability databases.

To minimize application downtime, use one of the following methods to sustain application connectivity through the listener, and drop the availability group.

Method 1 Associate listener with new availability group (role) in Failover Cluster Manager

This method enables you to maintain the listener while dropping and recreating the availability group.

1. On the SQL Server that the existing availability group listener is directing connections to, create a new, empty availability group. To simplify, use the transact-sql command to create an availability group with no secondary replica or database:


use master
go
create availability group ag
for replica on 'sqlnode1' with
(endpoint_url = 'tcp://sqlnode1:5022', availability_mode=asynchronous_commit, failover_mode=manual) 
 

2. Launch Failover Cluster Manager and click Roles in the left pane. In the pane listing the Roles select the original availability group.

3. In the bottom-middle pane, under the Resources tab right-click the availability group resource and choose Properties. Click the Dependencies tab and delete the dependency to the listener. Click OK.

 

 

4. In the bottom-middle pane, under the Resources tab, right click the listener and choose More Actions and then Assign to Another Role. In the dialog box, choose the new availability group and click OK.

 

 

5. In the Roles pane select the new availability group. In the bottom-middle pane, under the Resources tab, you should now see the new availability group and the listener resource. Right-click the new availability group resource and choose Properties. Click the Dependencies tab and select the listener resource from the drop-down box. Click OK.

 


 
     

6. In SQL Server Management Studio, use Object Explorer to connect to the SQL Server instance hosting the primary replica of the new availability group. Drill into AlwaysOn High Availability, then drill into the new availability group and then Availability Group Listeners. You should find the listener.

7. Right-click the listener and choose Properties. Enter the appropriate port number for the listener and click OK.

 


  

This will ensure that applications using the listener can still use it to connect to SQL Server hosting the production databases without interruption. The original availability group can now be completely removed and recreated or the databases and replicas can be added to the new availability group.

IMPORTANT: If recreating the original availability group, the following steps will re-assign the listener back to the recreated availability group role: assign the listener back to the recreated availability group role, set the dependency between the recreated availability group resource and the listener, and re-assign the port to the listener:

1. Launch Failover Cluster Manager and click Roles in the left pane. In the pane listing the Roles select the new availability group hosting the listener.

2.  In the bottom-middle pane, under the Resources tab right-click the new availability group resource, which currently has a dependency on the listener and choose Properties. Click the Dependencies tab and delete the dependency to the listener. Click OK.

3. In the bottom middle pane, under the Resources tab, right click the listener and choose More Actions and then Assign to Another Role. In the dialog box, choose the recreated availability group and click OK.

4. In the Roles pane select the recreated availability group. In the bottom middle pane, under the Resources tab, you should now see the recreated availability group and the listener resource. Right-click the recreated availability group resource and choose Properties. Click the Dependencies tab and select the listener resource from the drop-down box. Click OK.

4. In SQL Server Management Studio, use Object Explorer to connect to the SQL Server instance hosting the primary replica of the recreated availability group. Drill into AlwaysOn High Availability, then drill into the recreated availability group and then Availability Group Listeners. You should find the listener.

5. Right-click the listener and choose Properties. Enter the appropriate port number for the listener and click OK.

Method 2 Associate listener with existing SQL Failover Clustered Instance (SQLFCI)

If you are hosting your availability group on a SQL Server Failover Clustered Instance (SQLFCI) you can associate the listener clustered resource with the SQLFCI clustered resource group, while dropping and recreating the availability group.

1. Launch Failover Cluster Manager and click Roles in the left pane. In the pane listing the Roles select the original availability group.

2. In the bottom middle pane, under the Resources tab, right-click the availability group resource and choose Properties. Click the Dependencies tab and delete the dependency to the listener. Click OK.

3. In the bottom middle pane, under the Resources tab, right click the listener and choose More Actions and then Assign to Another Role. In the dialog box, choose the SQL Server FCI instance and click OK.


  
  
  

4. In the Roles pane select the SQL Server Failover Clustered Instance (SQLFCI) group. In the bottom middle pane, under the Resources tab, you should now see the new listener resource.

This will ensure that applications using the listener can still use it to connect to SQL Server hosting the production databases without interruption. The original availability group can now be completely removed and recreated.

IMPORTANT: Once the availability group is recreated, re-assign the listener back to the availability group role, set up the dependency between the new availability group resource and the listener and re-assign the port to the listener:

1. Launch Failover Cluster Manager and click Roles in the left pane. In the pane listing the Roles select the original SQL Failover Clustered Instance role.

2. In the bottom middle pane, under the Resources tab, right click the listener and choose More Actions and then Assign to Another Role. In the dialog box, choose the recreated availability group and click OK.

3. In the Roles pane select the recreated availability group. Under the Resources tab, you should now see the recreated availability group and the listener resource. Right-click the recreated availability group resource and choose Properties. Click the Dependencies tab and select the listener resource from the drop-down box. Click OK.

4. In SQL Server Management Studio, use Object Explorer to connect to the SQL Server instance hosting the primary replica of the recreated availability group. Drill into AlwaysOn High Availability, then drill into the recreated availability group and then Availability Group Listeners. You should find the listener. 

5. Right-click the listener and choose Properties. Enter the appropriate port number for the listener and click OK.

Method 3 Drop the availability group and recreate with the same listener name

  This method will result in a small outage for currently connected applications, because the availability group and listener are dropped and then recreated.

1. Drop the problematic availability group. This will also drop the listener.

2. Immediately create a new, empty availability group on the same server hosting the production databases, and defined with the original listener name. Applications should now successfully re-connect by using the new listener. For example assume your availability group listener is 'aglisten.' The following transact-sql creates an availability group with no database or secondary, but creates a listener, 'aglisten' which applications can resume connecting through.

use master
go
create availability group ag
for replica on 'sqlnode1' with (endpoint_url = 'tcp://sqlnode1:5022', availability_mode=asynchronous_commit, failover_mode=manual)
listener 'aglisten' (with ip ((n'11.0.0.25', n'255.0.0.0')), port=1433)
go

3. Recover the damaged database and add it and the secondary replica back to the availability group.

 

 

 


How to enable TDE Encryption on a database in an Availability Group

$
0
0

By default, the Add Database Wizard and New Availability Group Wizard for AlwaysOn Availability Groups do not support databases that are already encrypted:  see Encrypted Databases with AlwaysOn Availability Groups (SQL Server).

If you have a database that is already encrypted, it can be added to an existing Availability Group – just not through the wizard.   You’ll need to follow the procedures outlined in Manually Prepare a Secondary Database for an Availability Group.

This article discusses how TDE encryption can be enabled for a database that already belongs to an Availability Group.   After a database is already a member of an Availability Group, the database can be configured for TDE encryption but there are some key steps to do in order to avoid errors.

To follow the procedures outlined in this article you need:

  1. An AlwaysOn Availability Group with at least one Primary and one Secondary replica defined.
  2. At least one database in the Availability Group.
  3. A Database Master Key on all replica servers (primary and secondary servers)
  4. A Server Certificate installed on all replica instances (primary and all secondary replicas).

 

For this configuration, there are two servers:  

SQL1 – the primary replica instance,  and

SQL2 – the secondary replica instance.

Step One:  Verify each replica instance has a Database Master Key (DMK) in Master – if not, create one.

To determine if an instance has a DMK, issue the following query:

USE MASTERGOSELECT * FROMsys.symmetric_keysWHERE name = '##MS_DatabaseMasterKey##'

If a record is returned, then a DMK exists and you do not need to create one, but if not, then one will need to be created. To create a DMK, issue the following TSQL on each replica instance that does not have a DMK already:

CREATE MASTERKEY ENCRYPTIONBY PASSWORD = 'Mhl(9Iy^4jn8hYx#e9%ThXWo*9k6o@';

Notes:

  • If you query the sys.symmetric_keys without a filter, you will notice there may also exist a “Service Master Key” named:   ##MS_ServiceMasterKey##.   The Service Master Key is the root of the SQL Server encryption hierarchy. It is generated automatically the first time it is needed to encrypt another key. By default, the Service Master Key is encrypted using the Windows data protection API and using the local machine key. The Service Master Key can only be opened by the Windows service account under which it was created or by a principal with access to both the service account name and its password.  For more information regarding the Service Master Key (SMK), please refer to the following article:  Service Master Key.  We will not need to concern ourselves with the SMK in this article.
  • If the DMK already exists and you do not know the password, that is okay as long as the service account that runs SQL Server has SA permissions and can open the key when it needs it (default behavior).   For more information refer to the reference articles at the end of this post.
  • You do not need to have the exact same database master key on each SQL instance.   In other words, you do not need to back up the DMK from the primary and restore it onto the secondary.   As long as each secondary has a DMK then that instance is prepared for the server certificate(s).
  • If your instances do not have DMKs and you are creating them, you do not need to have the same password on each instance.   The TSQL command, CREATE MASTER KEY, can be used on each instance independently with a separate password.   The same password can be used, but the key itself will still be different due to how our key generation is done.
  • The DMK itself is not used to encrypt databases – it is used simply to encrypt certificates and other keys in order to keep them protected.  Having different DMKs on each instance will not cause any encryption / decryption problems as a result of being different keys.

 

Step Two:  Create a Server Certificate on the primary replica instance.

To have a Database Encryption Key (DEK) that will be used to enable TDE on a given database, it must be protected by a Server Certificate.  To create a Server Certificate issue the following TSQL command on the primary replica instance (SQL1):

USE MASTERGOCREATE CERTIFICATE TDE_DB_EncryptionCertWITH SUBJECT = 'TDE Certificate for the TDE_DB database'

To validate that the certificate was created, you can issue the following query:

SELECT name, pvt_key_encryption_type_desc, thumbprint FROMsys.certificates

which should return a result set similar to:

image
 

The thumbprint will be useful because when a database is encrypted, it will indicate the thumbprint of the certificate used to encrypt the Database Encryption Key.   A single certificate can be used to encrypt more than one Database Encryption Key, but there can also be many certificates on a server, so the thumbprint will identify which server certificate is needed. 

 

Step Three:  Back up the Server Certificate on the primary replica instance.

Once the server certificate has been created, it should be backed up using the BACKUP CERTIFICATE TSQL command (on SQL1):

USE MASTERBACKUP CERTIFICATE TDE_DB_EncryptionCertTOFILE = 'TDE_DB_EncryptionCert'WITH PRIVATEKEY (FILE = 'TDE_DB_PrivateFile',ENCRYPTION BY PASSWORD = 't2OU4M01&iO0748q*m$4qpZi184WV487')

The BACKUP CERTIFICATE command will create two files.   The first file is the server certificate itself.   The second file is a “private key” file, protected by a password.  Both files and the password will be used to restore the certificate onto other instances.

When specifying the filenames for both the server certificate and the private key file, a path can be specified along with the filename.  If a path is not specified with the files, the file location where Microsoft SQL Server will save the two files is the default “data” location for databases defined for the instance.   For example, on the instance used in this example, the default data path for databases is “C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA”.

 

Note:

If the server certificate has been previously backed up and the password for the private key file is not known, there is no need to panic.   Simply create a new backup by issuing the BACKUP CERTIFICATE command and specify a new password.   The new password will work with the newly created files (the server certificate file and the private key file).

 

Step Four:  Create the Server Certificate on each secondary replica instance using the files created in Step 3.

The previous TSQL command created two files – the server certificate (in this example: “TDE_DB_EncryptionCert”) and the private key file (in this example: “TDE_DB_PrivateFile”)  The second file being protected by a password.

These two files along with the password should then be used to create the same server certificate on the other secondary replica instances.

After copying the files to SQL2, connect to a query window on SQL2 and issue the following TSQL command:

CREATE CERTIFICATE TDE_DB_EncryptionCertFROMFILE = '<path_where_copied>\TDE_DB_EncryptionCert'WITH PRIVATEKEY
(   FILE = '<path_where_copied>\TDE_DB_PrivateFile',DECRYPTIONBY PASSWORD = 't2OU4M01&iO0748q*m$4qpZi184WV487')

This installs the server certificate on SQL2.  Once the server certificate is installed on all secondary replica instances, then we are ready to proceed with encrypting the database on the primary replica instance (SQL1).

 

Step Five:  Create the Database Encryption Key on the Primary Replica Instance.

On the primary replica instance (SQL1) issue the following TSQL command to create the Database Encryption Key.

USE TDE_DB2goCREATEDATABASE ENCRYPTION KEYWITH ALGORITHM = AES_256ENCRYPTIONBY SERVER CERTIFICATE TDE_DB_EncryptionCert

The DEK is the actual key that does the encryption and decryption of the database.  When this key is not in use, it is protected by the server certificate (above).  That is why the server certificate must be installed on each of the instances.    Because this is done inside the database itself, it will be replicated to all of the secondary replicas and the TSQL does not need to be executed again on each of the secondary replicas.

At this point the database is NOT YET encrypted – but the thumbprint identifying the server certificate used to create the DEK has been associated with this database.  If you run the following query on the primary or any of the secondary replicas, you will see a similar result as shown below:

SELECT db_name(database_id), encryption_state, 
    encryptor_thumbprint, encryptor_type FROMsys.dm_database_encryption_keys
 

image

 

Notice that TempDB is encrypted and that the same thumbprint (i.e. Server Certificate) was used to protect the DEK for two different databases.   The encryption state of TDE_DB2 is 1, meaning that it is not encrypted yet.

 

Step Six:  Turn on Database Encryption on the Primary Replica Instance (SQL1)

We are now ready to turn on encryption.   The database itself as a database encryption key (DEK) that is protected by the Server Certificate.   The server certificate has been installed on all replica instances.   The server certificate itself is protected by the Database Master Key (DMK) which has been created on all of the replica instances.   At this point each of the secondary instances is capable of decrypting (or encrypting) the database, so as soon as we turn on encryption on the primary, the secondary replica copies will begin encrypting too.

To turn on TDE database encryption, issue the following TSQL command on the primary replica instance (SQL1):

ALTERDATABASE TDE_DB2 SET ENCRYPTION ON

To determine the status of the encryption process, again query sys.dm_database_encryption_keys :

SELECT db_name(database_id), encryption_state, 
    encryptor_thumbprint, encryptor_type, percent_completeFROM sys.dm_database_encryption_keys

When the encryption_state = 3, then the database is encrypted. It will show a status of 2 while the encryption is still taking place, and the percent_complete will show the progress while it is still encrypting. If the encryption is already completed, the percent_complete will be 0.

image

 

At this point, you should be able to fail over the Availability Group to any secondary replica and be able to access the database without issue.

 

What happens if I turn on encryption on the primary replica but the server certificate is not on the secondary replica instance?

The database will quit synchronizing and possibly report “suspect” on the secondary.   This is because when the SQL engine opens the files and begins to read the file, the pages inside the file are still encrypted.   It does not have the decryption key to decrypt the pages.  The SQL engine will think the pages are corrupted and report the database as suspect.  You can confirm this is the case by looking in the error log on the secondary.   You will see error messages similar to the following:

 

2014-01-28 16:09:51.42 spid39s Error: 33111, Severity: 16, State: 3.

2014-01-28 16:09:51.42 spid39s Cannot find server certificate with thumbprint '0x48CE37CDA7C99E7A13A9B0ED86BB12AED0448209'.

2014-01-28 16:09:51.45 spid39s AlwaysOn Availability Groups data movement for database 'TDE_DB2' has been suspended for the following reason: "system" (Source ID 2; Source string: 'SUSPEND_FROM_REDO'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

2014-01-28 16:09:51.56 spid39s Error: 3313, Severity: 21, State: 2.

2014-01-28 16:09:51.56 spid39s During redoing of a logged operation in database 'TDE_DB2', an error occurred at log record ID (31:291:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.

 

The error messages are quite clear that the SQL engine is missing a certificate – and it’s looking for a specific certificate – as identified by the thumbprint.  If there is more than one server certificate on the primary, then the one that needs to be installed on the secondary is the one whose thumbprint matches the thumbprint in the error message.

The way to resolve this situation is to go back to step three above and the back up the certificate from SQL1 (whose thumbprint matches) and then create the server certificate on SQL2 as outlined in step four.   Once the server certificate exists on the secondary replica instance (SQL2), then you can issue the following TSQL command on the secondary (SQL2) to resume synchronization:

 

ALTERDATABASE TDE_DB2 SET HADR RESUME
  

References

Setting up Replication on a database that is part of an AlwaysOn Availability Group

$
0
0

This blog post gives detailed steps on setting up Transactional replication on a database that is part of an AlwaysOn availability Group. The below technet article lists the same with T-SQL statements:

http://technet.microsoft.com/en-us/library/hh710046.aspx

In this blog, I am going to list out the steps and screenshots too, wherever applicable. This blog doesn't include the steps to setup AlwaysOn AG.

 

What is Supported?

1) SQL Server replication supports the automatic failover of the publisher, the automatic failover of transactional subscribers, and the manual failover of merge subscribers. The failover of a distributor on an availability database is not supported.

2) In an AlwaysOn availability group a secondary database cannot be a publisher. Re-publishing is not supported when replication is combined with AlwaysOn Availability Groups.

 

 

 

 

Environment

AlwaysOn

SRV1: Synchronous Replica          - Current Primary

SRV2: Synchronous Replica

SRV3: Asynchronous Replica

Availability Group  :MyAvailabilityGroup

AG database        : MyNorthWind

AG Listener          : AGListener

 

 

Below is the environment we will be building at the end of this blog:

SRV1: Original Publisher

SRV2: Publisher Replica

SRV3: Publisher Replica

SRV4: Distributor and Subscriber (You can choose a completely new server to be the distributor as well, however do not have a distributor on any of the publishers in this case as the failover of a distributor is not supported in this case).

Overview

The following sections build the environment described above:

- Configure a remote distributor

- Configure the Publisher at the original Publisher

- Configure Remote distribution on possible publishers

- Configure the Secondary Replica Hosts as Replication Publishers

- Redirect the Original Publisher to the AG Listener Name

- Run the Replication Validation Stored Procedure to verify the Configuration

- Create a Subscription

 

1. Configure a remote distributor

The distributor should not be on the current (or intended) replica of the availability group of which the publishing database is part of. This just means that Distributor in our case, should not be on SRV1, SRV 2, SRV3 because these servers are part of the AG that has the publishing database (MyNorthWind).

We can have a dedicated server (which is not part of the AG) acting as a distributor or we can have the distributor on the subscriber (provided subscriber is not part of an AG).

 Let's configure distribution on SRV4.

  • Right click on Replication and select "Configure Distribution". The below screen comes up.

 

 

  • We'll select the first option as we want SRV4 as a distributor.

 

  •  Specify the snapshot folder location.

 

 

  • We'll go with the default distribution database folder.

 

 

 

  • In the below screen, we need to specify SRV1, SRV2 and SRV3 as publishers. Click on Add and then Add SQL Server Publisher. Connect to the 3 servers that can act as publishers. Note that SRV4 already exists in the list and you can choose to leave it that way.

.

 

 

  • This is how it should look like with all publishers added.

 

 

  • Enter in the password that the remote publishers will use to connect to the distributor.

 

 

  • Click on Next and then "Configure Distribution" and then next.
  • Click on Finish and now, the distribution is successfully set up.

 

2. Configure the Primary Replica as the original Publisher

Define SRV1 as the original publisher as it is currently the primary replica. You can have any of the AG replica as the original publisher, as long as it is the current primary replica.

 

  • In SQL Server Management Studio, use Object Explorer to connect to SRV1 and drill into Replication and then Local Publications.

 

  • Right click Local Publications and choose New Publication. Click Next.
  • In the Distributor dialog, choose the option 'Use the following server as the Distributor', click the Add button and add SRV4. Click Next.
  • Enter the same password that was used in Step 7 of "Configure the distributor".

 

 

  • Select the database to be published: MyNorthWind

 

  •  We'll be setting up Transactional Publication.

 

 

  • In the Articles and Filter Table rows dialogs, make your selections.
  • In the Snapshot dialog, for now, choose the 'Create a snapshot immediately..' and click Next.
  • In the Agent Security dialog box, specify the account under which Snapshot Agent and Log Reader Agent will run. You can also use the SQL Server Agent account to run the Snapshot Agent and Log Reader Agent.
  • In the Wizard Actions dialog, select 'Create the publication' and click Next.
  • Give the publication a name and click Finish in the Complete the Wizard dialog.

 

 

3) Configure Remote distribution on possible publishers

For the possible publishers and secondary replicas: SRV2 and SRV3, we'll have to configure the distribution as a remote distribution that we created on SRV1.

  • Launch SQL Server Management Studio. Using Object Explorer, connect to SRV2 and right click the Replication tab and choose Configure Distribution. Choose 'Use the following server as the Distributor' and click Add. Select SRV4 as the distributor.

 

 

  • In the Administrator Password dialog, specify the same password to connect to the Distributor.

 

  • In the Wizard Actions dialog, accept the default and click Finish.
  • Click finish and follow the same steps on SRV3 to configure the distribution as SRV4.

 

4) Configure the Secondary Replica Hosts as Replication Publishers

 In the event that a secondary replica transitions to the primary role, it must be configured so that the secondary can take over after a failover. All possible publishers will connect to the subscriber using a linked server. To create a linked server to the subscriber, SRV4 , run the below query on the possible publishers: SRV2 and SRV3.

EXEC sys.sp_addlinkedserver

@server = 'SRV4';

 

5) Redirect the Original Publisher to the AG Listener Name

We have already created an AG listener named AGListener. At the distributor (Connect to SRV4) , in the distribution database, run the stored procedure sp_redirect_publisher to associate the original publisher and the published database with the availability group listener name of the availability group.

 

USE distribution;

GO

EXEC sys.sp_redirect_publisher

@original_publisher = 'SRV1,

@publisher_db = 'MyNorthWind',

@redirected_publisher = 'AGListener';

 

6) Run the Replication Validation Stored Procedure to verify the Configuration

At the distributor (SRV4), in the distribution database, run the stored procedure sp_validate_replica_hosts_as_publishers to verify that all replica hosts are now configured to serve as publishers for the published database.

 

USE distribution;

GO

DECLARE @redirected_publisher sysname;

EXEC sys.sp_validate_replica_hosts_as_publishers

@original_publisher = 'SRV1',

@publisher_db = 'MyNorthWind',

@redirected_publisher = 'AGListener';

 

The stored procedure sp_validate_replica_hosts_as_publishers should be run from a login with sufficient authorization at each availability group replica host to query for information about the availability group. Unlike sp_validate_redirected_publisher, it uses the credentials of the caller and does not use the login retained in msdb.dbo.MSdistpublishers to connect to the availability group replicas.

 

7) Create a subscription

  • Right click on the publication: Publication_AlwaysOn and select New Subscriptions.

 

  •  Select the publication on SRV1.

 

 

  • We'll create a push Subscription, however a pull subscription will work as well.

 

 

  • Select the subscriber instance as SRV4 and a subscriber database

 

 

  • Select the SQL Server Agent credentials to run the Distribution Agent.

 

 

 

  • Select "Initialize at First Synchronisation" on the subscriber SRV4.

 

  • Select the subscriber instance as SRV4 and a subscriber database.

 

How to use the Replication Monitor ?

After failover to a secondary replica, Replication Monitor is unable to adjust the name of the publishing instance of SQL Server and will continue to display replication information under the name of the original primary instance of SQL Server. After failover, a tracer token cannot be entered by using the Replication Monitor, however a tracer token entered on the new publisher by using Transact-SQL, is visible in Replication Monitor.

At each availability group replica, add the original publisher to Replication Monitor.

 

Publisher failover Demonstration

In this section, we'll failover the Availability Group from the current primary replica and replication publisher: SRV1 to secondary replica and possible publisher:SRV2. This will not impact the working of Replication in any way.

 

  • The Failover Availability Group Wizard comes up.

 

  • Select the secondary replica you want to failover the AG to, in this case, SRV2.

 

 

  • Connect to SRV2 which is the SQL instance acting as the secondary replica.

 

 

  • Click on Finish and the failover to SRV2 should complete successfully.
  • We can also failover to asynchronous secondary replica and possible publisher, SRV3 in the same way.

 

 

  • This will cause data loss as SRV3 is an Asynchronous Replica.

 

  • Click on Finish.
  • However, after the failover to a Asynchronous secondary replica, the data movemnet on the AG database, MyNorthWinds is paused on the 2 secondary replicas-SRV1 and SRV2.
  • The database state will show "Not Synchronizing" on SRV2 and SRV1.

 

 

  • Right-click the availability database, MyNorthwind under AlwaysOn High Availability drop-down and select "Resume Data Movement". Follow the same on SRV1.

 

  • We can go with the default selection "Continue executing after error".

 

 

 

  • Resume data movemnet on SRV1 as well and AlwaysOn database MyNorthWind will show as "Synchronizing" instead of "Synchronized" as SRV3 is the primary replica now and it was et as an Asynchronous Replica initially.
  • After making these changes, Replication will function as usual.

 

 

Prabhjot Kaur

SQL Server AlwaysOn Support Team

Issue: Replica Unexpectedly Dropped in Availability Group

$
0
0

You have deployed AlwaysOn availability groups and observe that one of the replicas unexpectedly disappears from your availability group. In addition to replica loss, there may be issues with the availability group resource and availability.

Why did my replica disappear?

As a clustered resource, the availability group state information is stored and maintained in the Windows Cluster store. SQL Server must be able to communicate and access the availability replica's state information using Windows cluster protocol. SQL Server will drop a replica if it tries to read the replica information from the cluster store (registry) and one of the following conditions occurs:

  • SQL queried the cluster store successfully and the replica did not exist - then SQL Server will drop the replica

  • SQL queried the cluster store successfully and the replica was found, but the data was corrupt

  • SQL queried the cluster store unsuccessfully

If one of these conditions occurs, the replica may be removed. This behavior is by design, but should be very uncommon and may signify a problem with Windows Cluster responsiveness and warrants further investigation.

How do I know SQL Server dropped the replica? 

If SQL Server dropped the replica for one of the reasons described here, SQL Server native error 41096 is raised and reported in the SQL Server error log:

2014-01-21 11:53:16.53 spid30s     AlwaysOn: The local replica of availability group 'groupname' is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store.  This is expected if the availability group has been removed from another instance of SQL Server. This is an informational message only. No user action is required.
2014-01-21 11:53:16.53 spid30s     The state of the local availability replica in availability group 'groupname' has changed from 'SECONDARY_NORMAL' to 'NOT_AVAILABLE'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

How to Respond 

If the replica removed is a secondary, you can add the replica back into the group without re-initializing the databases so long as the transaction log has not been truncated (due to log backup):

1. Add the replica back to the availability group. Connect to SQL Server hosting the primary and execute ALTER AVAILABILITY GROUP ADD REPLICA command:

use master
go
alter availability group ag
add replica on 'sqlnode2'
with(endpoint_url = 'tcp://sqlnode2:5022',
availability_mode=asynchronous_commit, failover_mode=manual)

 2. Join the availability group. Connect to the SQL Server hosting the secondary and execute the ALTER AVAILABILITY GROUP JOIN command:

 alter availability group ag join

 3. Join each database back to the replica. Connect to the SQL Server hosting the secondary and execute the ALTER DATABASE SET HADR command:

 alter database agdb set hadr availability group = ag

If the command 'alter database...set hadr...' fails with error 1478 it is because a log backup has occurred and the truncation point has advanced, creating a break in the log chain:

Msg 1478, Level 16, State 211, Line 2
The mirror database, "a", has insufficient transaction log data to preserve the log backup chain of the principal database.  This may happen if a log backup from the principal database has not been taken or has not been restored on the mirror database. it is because the database log was backed up and the log chain has been broken. To avoid full re-initialization of the database, restore the backed up log to the database at the secondary.

It is still possible to restore the backed up log(s) files to the secondary database WITH NORECOVERY and then successfully join the database back into the group:


If the replica is removed at the primary:

  • If the databases were in the primary role at the time the replica was removed, they are in a recovered state and accessible.

  • If the databases had transitioned to a resolving role at the time the replica was removed, then the databases need to be restored with recovery in order to put them back into production:

 restore database agdb with recovery

In this case, the availability group must be dropped and recreated.

 

How to map logins or use contained SQL database user to connect to other replicas and map to availability databases

$
0
0

Assume a working AlwaysOn environment with an availability group. Suppose this availability group is failed over to another replica. When the application re-connects to the (new primary) replica with the same SQL authenticated user, using instance name or using listener, there may be login error like below.

This error is mentioned below for reference:

Error in SQLCMD / SSMS:

Login failed for user 'TestSQLLogin1'.

 

Error in SQL ERRORLOG:

Error: 18456, Severity: 14, State: 5.

Login failed for user 'TestSQLLogin1'. Reason: Could not find a login matching the name provided. [CLIENT: 192.168.1.11]

 

The State 5 in above mentioned message denotes an invalid user id or that the login does not exist.

If this login is created with the same login id on the replica where the login failures occurred, connectivity may still fail for this login, and the error may change to below.

 

 

Error in SQLCMD / SSMS:

Login failed for user 'TestSQLLogin1'.

Cannot open database requested by the login. The login failed.

 

Error in SQL ERRORLOG:

Error: 18456, Severity: 14, State: 38.

Login failed for user 'TestSQLLogin1'. Reason: Failed to open the explicitly specified database 'ContosoCRM'. [CLIENT: 192.168.1.11]

 

State 38 in above message denotes database specified in connection string is no longer valid or online. Assuming database is online, then this could be a symptom of orphaned login.

 

 

What is causing this issue?

 

 A login is required on all replicas that can transition to the primary role, especially in the case where an availability group listener is defined, so that when an application attempts to re-connect following a failover, authentication is successful at that SQL Server instance. In addition, a login is internally identified in SQL by a Security Identifier (SID) value. This value should be same on all replicas.

 

To test, run below query on all replicas after changing the name to appropriate login.

SELECT @@SERVERNAME SERVERNAME, name, sid FROM sys.server_principals WHERE name='TestSQLLogin1';

 

 

A result like above shows that the sid value is different on both replicas/servers. The login should exist and its sid should be identical at all SQL Servers hosting the availability replica.

  

How to create login with same SID value on all replicas?

1. Create a SQL login on current primary replica and give it appropriate permissions in the database.

2. On the primary replica, run the script mentioned in section "Method 3: Create a log in script that has a blank password" step 2 of below mentioned KB. This script creates a stored procedure called sp_help_revlogin.

 

http://support.microsoft.com/kb/918992

How to transfer logins and passwords between instances of SQL Server

 

3. On the primary replica, run this stored procedure for the SQL login (change parameter value to appropriate login name created in step 1 above). This scripts the CREATE LOGIN statement. However note that it also captures the SID.

 

EXEC sp_help_revlogin 'TestSQLLogin1';

 

 

4. Run this CREATE LOGIN script on all other replicas.

5. To test, failover to each replica, transitioning it to the primary role and then attempt to connect using the login you created.

 

Can this error also occur for a Windows authenticated user?

This error does not occur for a domain user, since the SID of a domain user is same across replicas. Such a user can directly be created using SSMS. Be sure that the domain user has been added to the replica.

 

However, this error can also occur for Windows pass-through authentication. The SID of a Windows account cannot be manually specified using CREATE LOGIN. Hence, pass-through authentication cannot be used for such connections. Instead create login using domain user (this SID will automatically be same on all systems) or create login using SQL user (using steps mentioned in this article).

 

Is there another option that does not involve having to manually create a login on each replica? 

Contained databases are a good option. This involves a one-time configuration. Enable contained databases and create a database user with the necessary permissions to execute the function at the secondary.

1. Enable contained databases at the server level at the SQL Server hosting the primary and the secondary.

EXEC sp_configure 'show advanced', 1;
RECONFIGURE;
EXEC sp_configure 'contained database authentication', 1;
RECONFIGURE;

2. Enable partial containment on the availability database at the primary:

ALTER DATABASE ContosoCRM SET CONTAINMENT=PARTIAL;

3. Create your SQL database user in your availability database:

USE ContosoCRM;
CREATE USER TestSQLLogin1 WITH PASSWORD ='Password1';

4. Grant that database user the necessary permissions to execute the function:

5. Test your connection – note that it is key that you specify the catalog (database) in which the created user is defined.

Keywords

AlwaysOn, SQL Server.

 

Author

Vijay Rodrigues

 

SQL Server manages Preferred and Possible Owner Properties for AlwaysOn Availability Group/Role

$
0
0

As a clustered resource, the availability group clustered resource/role have configurable cluster properties, like possible owners and preferred owners. SQL Server actively manages these resource properties. These properties are set automatically by SQL Server and should not be modified using Failover Cluster Manager.

SQL Server automatically manages an availability group's Possible and Preferred Owner properties

  • The current primary replica is always set as the availability group resource's Possible Owner and also as a Preferred Owner.
  • When two replicas are configured for AUTOMATIC failover mode (limit is two replicas), both replicas (primary and failover partner) are selected as Possible Owners and Preferred Owners.
  • Preferred owners has a priority. The priority is first for current primary replica.

Note:

This is only for the availability group resource and not for other resources. For example, dependent resources like listener, listener IP have all availability group failover nodes as possible owners. Another example is SQL cluster resources (their possible owners are not changed), since their dependencies are installed during setup.

Demo (change from Manual to Automatic)

Consider an availability group with failover mode as manual. This has possible owners and preferred owners mentioned further below.

 

 

Now, if we change the Failover Mode from Manual to Automatic, then SQL Server automatically changes the Preferred and Possible owners if required.

 

 

 

 

Demo (failover to a replica)

If we failover (like from SQLONE to SQLTWO), then possible and preferred owner is automatically changed by SQL Server, if required. In below scenario, the order of preferred owners get changed since SQLTWO is now the primary replica.

  

 

In conclusion, there are clustered resources for which you can configure possible / preferred owners to control. For availability groups, SQL server dynamically configures these values, and they should not be modified. Manual modification can result in unexpected behavior of resource/group.

 

More Information:

Reference query to get failover mode:
SELECT replica_server_name, failover_mode_desc FROM sys.availability_replicas;
 
replica_server_name failover_mode_desc
--------------------------------------
SQLONE AUTOMATIC
SQLTHREE AUTOMATIC
SQLTWO MANUAL
(3 row(s) affected)

Additional information:
The possible owner change is done using rcm::RcmApi::AddPossibleOwner, rcm::RcmApi::RemovePossibleOwner.

Keyword(s):
AlwaysOn, SQL Server.

Author:
Vijay Rodrigues.

 

 

 

 

 

 

 

 

 

 

 

Manual Failover of Availability Group to Disaster Recovery site in Multi-Site Cluster

$
0
0

Setting up an Always On availability group in a multi-site cluster for disaster recovery (DR) scenario is a common practice.  A common configuration is two nodes in the cluster at the primary data center with a file share or disk witness, and a third node at a remote data center.  When configuring a remote node in your Windows cluster for a disaster recovery site, it is common to remove the quorum vote for that node. A normal vote configuration is for the nodes and witness in the primary datacenter to be configured with quorum votes, and the node at the DR site to not have a vote. 

In the event that the primary data center is lost or expected to be offline for an extended period of time, the availability group must be brought online on the node at the DR site manually.  The following steps are necessary in order to successfully bring the availability group online at the (DR) site.

Force Cluster to start at DR site node

With the failure of the primary data center, quorum will be lost.  In order to bring the cluster service up on the remote node, then quorum must be forced, to start on the node at the DR site.

Start an elevated Windows PowerShell via Run as Administrator, and execute the following:

Start-ClusterNode –Name "<NodeName>" -FixQuorum

 

Confirm that Cluster has started on the node:

Get-ClusterNode –Name “<NodeName>”

 

Adjusting Voting Rights for the cluster 

With the cluster service started, voting rights for the nodes can be adjusted. If the remote node does not have a vote, then it will need to be configured to have a vote.

PowerShell:

(Get-ClusterNode –Name "NodeName").NodeWeight=1

Get-ClusterNode | fl Name, NodeWeight

 

Once the remote node has been granted a vote for quorum, then remove the votes of the two nodes in the primary datacenter.

PowerShell:

(Get-ClusterNode –name "NodeName1").NodeWeight=0

(Get-ClusterNode –name "NodeName2").NodeWeight=0

Get-ClusterNode | fl Name,NodeWeight.

 

Bring Availability Group Resource Online

Once the cluster service on the remote node has started, the availability group will show offline in the Failover Cluster Manager, and cannot be brought online. 

On the remote DR node, connect to SQL Server and issue the following query in order to bring the availability group online:

ALTER AVAILABILITY GROUP <availability group> FORCE_FAILOVER_ALLOW_DATA_LOSS


At this point, the availability replica is online in the primary role, and the availability databases should be available for production on the DR node.

IMPORTANT: When issuing the failover command, 'FORCE_FAILOVER_ALLOW_DATA_LOSS' must be issued because the cluster service was started with force quorum, even if the secondary was setup to be synchronous commit.

Attempting to failover with the command  <ALTER AVAILABILITY GROUP <AVAILABLITY GROUP NAME> FAILOVER' would fail with the following message:

Msg 41142, Level 16, State 34, Line 1

The availability replica for availability group '<availability group>' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss).
Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data
loss). For more information, see SQL Server Books Online.

Synchronous Commit – Are My Availability Databases 'Failover Ready'?

Generally, DR site availability replicas are configured for asynchronous commit, because of the performance implications caused by replicating a long distance. However, if the secondary was setup as synchronous commit, the following query will list the databases and their synchronization
status.

select dharcs.replica_server_name, dhdrcs.is_failover_ready, dhdrcs.database_name, dhdrcs.recovery_lsn, dhdrcs.truncation_lsn
from sys.dm_hadr_database_replica_cluster_states dhdrcs join sys.dm_hadr_availability_replica_cluster_states dharcs
on(dhdrcs.replica_id = dharcs.replica_id)
where dharcs.replica_server_name = @@servername

 

The second column is_failover_ready indicates if the database is able to fail over without data loss (in a synchronized state).  If the value of the column is 1, then the database was synchronized when the availability group went offline and can come online without any data loss.  If the value of the column is 0, then the database was not synchronized when the availability group went offline, and there would be data loss if the database was brought online.  If the secondary was setup as asynchronous, then the value of is_failover_ready would always be 0.

To Detect if there was data is loss when failing over to DR Site.

If this data was not captured from primary before failure, then there is no way to determine loss until the original primary is recovered.  Save the the recovery_lsn and truncation_lsn from the query. When the original primary node is recovered, it will be in a suspended state.  Query these values on the original primary (now in a secondary role) to determine the data loss between the original primary and the DR secondary (current primary).

NOTE:  To keep from having the cluster experience a split brain scenario, the nodes in the primary datacenter should only be brought up, if the network connection between the two sites is working.

Reference Links:

http://technet.microsoft.com/en-us/library/cc730649(v=WS.10).aspx

http://msdn.microsoft.com/en-us/library/hh270281.aspx

http://msdn.microsoft.com/en-us/library/hh270280.aspx

 

 

 

Create Listener Fails with Message 'The WSFC cluster could not bring the Network Name resource online'

$
0
0

One of the most common configuration issues customers encounter is availability group listener creation. When creating an availability group listener in SQL Server, you might encounter the following messages:

Msg 19471, Level 16, State 0, Line 2
The WSFC cluster could not bring the Network Name resource with DNS name '<DNS name>' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible. Use a different DNS name to resolve name conflicts, or check the WSFC cluster log for more information.      

Msg 19476, Level 16, State 4, Line 2
The attempt to create the network name and IP address for the listener failed. The WSFC service may not be running or may be inaccessible in its current state, or the values provided for the network name and IP address may be incorrect. Check the state of the WSFC cluster and validate the network name and IP address with the network administrator.

When using SQL Server Management Studio to add the listener to an existing availability group, these errors may appear in a dialog box:

Diagnosing Listener Creation Failure

The majority of time, listener creation failure resulting in the messages above are due to a lack of permissions for the Cluster Name Object (CNO) in Active Directory to create and read the listener computer object.

The CNO is the Windows Cluster computer object itself. If you created your Windows Cluster and named it AGCluster - it would appear in Failover Cluster Manager. This is your CNO:

 

In order to confirm that a lack of CNO permissions is responsible for listener creation failure, launch Powershell with elevated privileges and generate the cluster log on the node hosting the availability group primary replica, where you failed to create the listener:

 

Confirm the problem is CNO permissions

Open the cluster log using Notepad. Here are examples of errors raised due to the CNO lack of permissions, when attempting to create the listener. Some important information these errors tell us:

 1 The domain controller where those permissions are being checked. In the below examples, it is in domain 'AGDC.' Therefore, when modifying the permissions in Active Directory be sure it is done in that DC. 

 2 The container that the listener is being created in. In the below examples, it is the Computers container. This is important - you need to know the container in Active Directory to assign the CNO permissions on.

Example
00000be0.00001a4c::2014/03/18-15:32:06.068 INFO  [RES] Network Name <ag_aglisten>: Using domain controller\DC.AGDC.COM.
00000be0.00001a4c::2014/03/18-15:32:06.237 INFO  [RES] Network Name <ag_aglisten>: Failed to find a computer account for aglisten. Attempting to create one on DC \DC.AGDC.COM.
00000be0.00001a4c::2014/03/18-15:32:06.237 INFO  [RES] Network Name <ag_aglisten>: Trying NetUserAdd() to create computer account aglisten on DC \DC.AGDC.COM in default Computers container
00000be0.00001a4c::2014/03/18-15:32:06.295 ERR   [RES] Network Name <ag_aglisten>: Unable to create computer account aglisten on DC \\DC.AGDC.COM, in default Computers container, status 5
000004f8.00002154::2014/03/18-15:32:06.491 INFO  [RCM] rcm::RcmGum::SetDependencies(ag_aglisten)
00000be0.00001a4c::2014/03/18-15:32:06.492 ERR   [RHS] Online for resource ag_aglisten failed.

Example
0000125c.00000f0c::2014/01/14-21:48:31.533 ERR   [RES] Network Name: [NNLIB] Binding to DC \\DC.AGDC.COM (domain AGDC.COM) failed, status 5.
0000125c.00000f0c::2014/01/14-21:48:31.533 INFO  [RES] Network Name <ag_aglisten>: AccountAD: End of Slow Operation, state: Initializing/Writing, prevWorkState: Writing
0000125c.00000f0c::2014/01/14-21:48:31.533 WARN  [RES] Network Name <ag_aglisten>: AccountAD: Slow operation has exception ERROR_ACCESS_DENIED(5)' because of 'status'
0000125c.00000f0c::2014/01/14-21:48:31.533 INFO  [RES] Network Name: Agent: OnInitializeReply, Failure on (2a487cec-66af-4ea6-92d4-95bc70f082a4,AccountAD): 5  

Example
000007c8.00001720::2014/03/27-15:48:21.804 INFO  [RES] Network Name <ag_aglisten>: AccountAD: OU name for VCO is CN=Computers,DC=AGDC,DC=COM
...
000007c8.00001720::2014/03/27-15:48:21.845 INFO  [RES] Network Name: [NNLIB] Using legacy API to create object in default container CN=Computers,DC=AGDC,DC=COM
000007c8.00001720::2014/03/27-15:48:22.165 INFO  [RES] Network Name: [NNLIB] NetUserAdd object aglisten on DC: \\DC.AGDC.COM, result: 0
...
000007c8.00001720::2014/03/27-15:48:22.387 INFO  [RES] Network Name <ag_aglisten>: AccountAD: Giving write permissions for OS and OS version attributes for the core netname
000007c8.00001720::2014/03/27-15:48:22.450 INFO  [RES] Network Name <ag_aglisten>: AccountAD: Setting OS and OS version in AD
00001298.000015f8::2014/03/27-15:48:22.450 INFO  [NM] Received request from client address SQLNODE1.
000007c8.00001720::2014/03/27-15:48:22.498 WARN  [RES] Network Name <ag_aglisten>: AccountAD: Setting OS attributes failed with error 5

Resolve CNO Permissions  or Provision the Listener Computer Object

Option 1 Grant CNO necessary permissions

Grant the permissions "Read all properties" and "Create Computer objects" to the CNO via the container. Here's an example of granting the required permissions for demonstration purposes:

1. Open the Active Directory Users and Computers Snap-in (dsa.msc).

2. Locate the “Computers” container or the container that the listener is being created in, this is assuming that this is the container where the listener computer object creation is being attempted.

 

 

3. Right-click View and select "Advanced Features."

 

clip_image005

 

4. Right-click the Computers container and choose Properties.

 

 

 5. Click the Add button and enter the cluster named object (CNO). In this example, it is agcluster$. Click the Object Types button, check 'Computers' and click Ok and then Ok again.

 

 

6. Back in the Properties dialog, click the Advanced button and the "Advanced Security Settings for SQL" dialog should appear.

 

 

7. Click the Add button. Select "Read all properties" and "Create Computer objects." Click OK until you're back to the Active Directory Users and Computer window.

 

 

8. Attempt to recreate the listener.

 

Option # 2 Pre-Stage the VCO

This option is useful in situations where the domain administrator does not allow the CNO “Read All Properties” and “Create computer Objects” permissions:

1. Ensure that you are logged in as a user that has permissions to create computer objects in the domain.

2. Open the Active Directory Users and Computers Snap-in (dsa.msc).

3. Right-click View and select "Advanced Features."

 

clip_image005

 

4. Right click the OU/Container you want the VCO to reside in and click “New” -> “Computer.” In the example below, we are creating the listener object in the Computers container.

 

 

5. Provide a name for the object (this will be your listener name) and click “OK."

 

 

6. Right click the VCO you just created and select “Properties”. Click the Security tab.

 

 

7. Under Security tab, click the Add button. Enter the cluster named object (CNO). In this example, it is agcluster$. Click the Object Types button. Select Computers and click Ok.

 

 

8. Highlight the CNO, check the following permissions, and click “OK” (alternatively, choose Full Control)

Read
Allowed To Authenticate
Change Password
Receive As
Reset Password
Send As
Validate write To DNS Host Name
Validate Write To Service Principle Name
Read Account Restrictions
Write Account Restrictions
Read DNS Host Name Attributes
Read MS-TS-GatewayAccess
Read Personal Information
Read Public Information

 

9. Attempt to create the availability group listener.

 


Use ReadIntent Routing with Azure AlwaysOn Availability Group Listener

$
0
0

The AlwaysOn availability group listener is supported in Windows Azure virtual machines (VMs), but requires unique configuration steps. For more information on deploying an AlwaysOn availability group listener in Windows Azure virtual machines, see the following link.

Tutorial: Listener Configuration for AlwaysOn Availability Groups in Windows Azure

http://msdn.microsoft.com/en-us/library/windowsazure/dn376546.aspx

Configure Read-Only Routing with a Public Load Balanced Listener or Internal Load Balanced Listener

The tutorial instructs you differently for the kind of listener you are creating: a listener that uses public load balancing or a listener that uses internal load balancing (ILB). The following steps provide instructions for configuring read only routing with a public load balancing listener. Look for notes through the instructions that instruct you differently when configuring for an ILB listener.

 Configure the Azure availability group listener for read-only routing

When the listener is configured using the above step by step tutorial, your application can connect to the listener from outside the cloud service hosting the availability group. Your Azure listener can support read-only routing, but there are additional steps necessary to configure the listener for read-only routing of your application connections.

The following steps will configure your listener to route read-only connection requests to one of the secondary replicas in your Azure availability group. The steps assume an availability group named ag defined on two SQL Server virtual machines, SQLN1 and SQLN2 in cloud service whose DNS name is mycloudservice.cloudapp.net.

Configure virtual machine public and private ports for read-only routing

IMPORTANT This section is not required if you are configuring read-only routing for an ILB listener. If you are configuring an ILB listener skip this section and proceed with section Configure Availability Group Replicas to Accept Read-Only Connections

In Windows Azure, go to each of the virtual machines that are hosting availability group replicas (in this example, SQLN1 and SQLN2) and create a new Azure virtual machine endpoint. For SQLN1 create an Azure virtual machine endpoint with a public port of 40001 and private port of 1433 and create an Azure virtual machine endpoint on SQLN2 with a public port of 40002 and private port of 1433.

NOTE: This assumes that SQL Server is listening on the default 1433 port. If it is using a non-default port, specify that port for the private port instead.

1 In the virtual machine's Endpoints page, click the Add button at the bottom of the screen.

2 Click 'Add an Endpoint to a Virtual Machine' with 'Add a Stand-Alone Endpoint' selected and click the arrow at the bottom right in page to advance.

3 Give the new endpoint a name. Define the public port as 40001 and the private port as 1433. When you configure the read-only routing URL for your replicas, you will use the public port value to direct application connectivity to your read-only secondary replica.

In this scenario, the public port can be configured to be anything you wish. Do not select to create a load-balanced set.


Below is what the endpoint will look like on the SQLN1 virtual machine Endpoints page.

Configure Availability Group Replicas to Accept Read-Only Connections

Configure your availability group replicas to allow for read-only connection requests when in the secondary role. The following script configures both SQLN1 and SQLN2 replicas, when in the secondary role, to accept read-only connections through the listener.

ALTER AVAILABILITY GROUP [ag]
MODIFY REPLICA ON N'SQLN1' WITH (SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY))
GO

ALTER AVAILABILITY GROUP [ag]
MODIFY REPLICA ON N'SQLN2' WITH (SECONDARY_ROLE(ALLOW_CONNECTIONS = READ_ONLY))
GO

NOTE: Read-only routing can support ALLOW_CONNECTIONS property set to READ_ONLY or ALL.

Alternatively, use SQL Server Management Studio to pull up the availability group properties using Object Explorer, and set the Readable Secondary property to 'Read-intent only.'

 

Configure Read-Only Routing

Each availability replica that will accept these read-only connections must be defined with a read-only routing URL and a routing list.

1 If you are configuring read-only routing for a public load balanced listener, continue with this step. If you are configuring read-only routing for an internally load balanced listener, go to step 2.

In order to support read-only routing with an Azure listener, configure the READ_ONLY_ROUTING_URL for each availability replica, to use the cloud service name and the endpoint public port we created in section above titled Configure virtual machine public and private ports for read-only routing. This URL gives an application the address by which to connect to the read-only replica.

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN1' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://mycloudservice.cloudapp.net:40001'));
GO

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN2' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://mycloudservice.cloudapp.net:40002'));
GO

2 If you are configuring read-only routing for an internally load balanced listener, continue with this step. Specify the address of the node and the internal port address that SQL Server is listening on.

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN1' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SQLN1:1433'));
GO

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN2' WITH (SECONDARY_ROLE (READ_ONLY_ROUTING_URL = N'TCP://SQLN2:1433'));
GO

3 Define a routing list for each replica. When the replica is in the primary role, this designates where to route read-only connection requests to. For example, when SQLN1 is in the primary role, define our routing list to consist of SQLN2 which is where read-only connection requests will be routed to.

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN1' WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=('SQLN2')));
GO

ALTER AVAILABILITY GROUP ag MODIFY REPLICA ON N'SQLN2' WITH (PRIMARY_ROLE (READ_ONLY_ROUTING_LIST=('SQLN1')));
GO

Confirm read-only routing using SQLCMD

To test your newly configured read-only routing, use SQLCMD to specify the application intent option (-K). SQLCMD ships with SQL Server 2012 and supports the latest SQL Server connection parameters for AlwaysOn availability groups, including the new Application Intent connection property.

NOTE: You must also specify one availability database from the availability group using the database option (-d). If this option is not specified your connection will not be successfully routed to the secondary replica.

When attempting to connect to a public load balanced listener, from a command prompt, execute the SQLCMD tool:

C:\>sqlcmd -S mycloudservice.cloudapp.net,59998 -U sa -P xxx -K readonly -d agdb

Where mycloudservice.cloudapp.net is the DNS name of the cloud service, 59998 is the public endpoint defined during listener creation, and agdb is a database defined in the availability group.

 

When attempting to connect to an internal load balanced listener, from a command prompt, execute the SQLCMD tool, specifying the listener name instead of the DNS name of the cloud service. You can successfully connect so long as you are running on a client that is in the same virtual network:

C:\>sqlcmd -S aglisten,59998 -U sa -P xxx -K readonly -d agdb

Create Azure Listener fails with "Unable to save property changes for The parameter is incorrect"

$
0
0

Symptom

When creating a listener for an availability group hosted in Windows Azure virtual machines, the following error may occur.

Unable to save property changes for <cluster resource name> The parameter is incorrect

 

Cause

Creating an availability group listener for an availability group hosted in Azure virtual machines requires special steps for creating the listener. These steps are documented here.

http://msdn.microsoft.com/en-us/library/dn376546.aspx

 As part of creating the Azure listener, Windows cluster requires a special hotfix applied on each virtual machine defined as a Windows cluster node. If the hotfix is not installed, the error described above, occurs when attempting to execute the PowerShell script documented in step 8 of 'Step 4: Create the availability group listener' which instructs you to execute a PowerShell script:

            8.Copy the PowerShell script below into a text editor and set the variables to the values you noted earlier.

 

Resolution

Apply the hotfix as directed on all nodes that are defined in the Windows cluster and that may host the primary replica of the availability group. For more information and to download the fix, see the following article:

 Update enables SQL Server Availability Group Listeners on Windows Server 2008 R2 and Windows Server 2012-based Windows Azure virtual machines

http://support.microsoft.com/kb/2854082

This error occurs because Hotfix KB 2854082 is not installed on the clustered nodes. 

What the fix does

This update enables support for Windows cluster resources connectivity on Windows Server 2008 R2 and Windows Server 2012-based Windows Azure virtual machines. This support enables the SQL Server Availability Group Listeners.


IaaS with SQL AlwaysOn - Tuning Failover Cluster Network Thresholds

$
0
0

Symptom

When running Windows Failover Clustering in IaaS with SQL Server AlwaysOn, changing the cluster setting to a more relaxed monitoring state is recommended. Cluster settings out of the box are restrictive and could cause unneeded outages. The default settings are designed for highly tuned on premises networks and does not take into account the possibility of induced latency caused by a multi-tenant environment such as Windows Azure (IaaS).

Windows Server Failover Clustering is constantly monitoring the network connections and health of the nodes in a Windows Cluster.  If a node is not reachable over the network, then recovery action is taken to recover and bring applications and services online on another node in the cluster. Latency in communication between cluster nodes can lead to the following error:   

Error 1135 (system event log)

Cluster node 'Node1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. 

Cluster.log Example

0000ab34.00004e64::2014/06/10-07:54:34.099 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 10.xx.x.xxx:3343 remote address 10.x.xx.xx:3343
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] got event: Remote endpoint 10.xx.xx.xxx:~3343~ unreachable from 10.xx.x.xx:~3343~
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Marking Route from 10.xxx.xxx.xxxx:~3343~ to 10.xxx.xx.xxxx:~3343~ as down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] Checking to see if all routes for route (virtual) local fexx::xxx:5dxx:xxxx:3xxx:~0~ to remote xxx::cxxx:xxxd:xxx:dxxx:~0~ are down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] All routes for route (virtual) local fxxx::xxxx:5xxx:xxxx:3xxx:~0~ to remote fexx::xxxx:xxxx:xxxx:xxxx:~0~ are down
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [CORE] Node 8: executing node 12 failed handlers on a dedicated thread
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: Cleaning up connections for n12.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [Nodename] Clearing 0 unsent and 15 unacknowledged messages.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: n12 node object is closing its connections
0000ab34.00008b68::2014/06/10-07:54:34.099 INFO  [DCM] HandleNetftRemoteRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 1: Old: 05.936, Message: Response, Route sequence: 150415, Received sequence: 150415, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:28.000, Ticks since last sending: 4
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: closing n12 node object channels
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 2: Old: 06.434, Message: Request, Route sequence: 150414, Received sequence: 150402, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.665, Ticks since last sending: 36
0000ab34.0000a8ac::2014/06/10-07:54:34.099 INFO  [DCM] HandleRequest: dcm/netftRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 3: Old: 06.934, Message: Response, Route sequence: 150414, Received sequence: 150414, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.165, Ticks since last sending: 4
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 4: Old: 07.434, Message: Request, Route sequence: 150413, Received sequence: 150401, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:26.664, Ticks since last sending: 36
……
……
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realLocal>10.xxx.xx.xxx:~3343~</realLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realRemote>10.xxx.xx.xxx:~3343~</realRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualLocal>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualRemote>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Delay>1000</Delay>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Threshold>5</Threshold>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Priority>140481</Priority>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Attributes>2147483649</Attributes>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO  </struct mscs::FaultTolerantRoute>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO   removed
……
……
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: Lost quorum (3 4 5 6 7 8)
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: goingAway: 0, core.IsServiceShutdown: 0
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   lost quorum (status = 5925)

Cause

There are 2 settings that are used to configure the connectivity health of the cluster.

Delay– This defines the frequency at which cluster heartbeats are sent between nodes.  The delay is the number of seconds before the next heartbeat is sent.  Within the same cluster there can be different delays between nodes on the same subnet and between nodes which are on different subnets.

Threshold– This defines the number of heartbeats which are missed before the cluster takes recovery action.  The threshold is a number of heartbeats.  Within the same cluster there can be different thresholds between nodes on the same subnet and between nodes which are on different subnets.

By default Windows Server 2012 sets Delay at 5 and Threshold at 1000 ms. For example, if connectivity monitoring fails for five seconds, the failover Threshold is reached resulting in the unreachable node being removed from cluster membership. This results in the resources being moved to another available node on the cluster. Cluster errors will be reported, including cluster error 1135 (above) is reported.

Resolution

In an IaaS environment, relax the Cluster network configuration settings.

Note: Increasing the resiliency of your Cluster environment by adjusting the Cluster network configuration settings can result in increased downtime. For more details and guidance on the modification of Windows Cluster network configuration settings, see the following blog:

Tuning Failover Cluster Network Thresholds

http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx 

Steps to verify current configuration

Check the current Cluster network configuration settings use the get-cluster command. Here we can see the default Windows Cluster 2012 Delay and Threshold settings.

PS C:\Windows\system32> get-cluster | fl *subnet*

Default, minimum and maximum values for each support OS

 OSMinMaxDefault
CrossSubnet Threshold2008 R23205
CrossSubnet Threshold201231205
CrossSubnet Threshold2012 R231205
SameSubnet Threshold2008 R23105
SameSubnet Threshold201231205
SameSubnet Threshold2012 R231205

Recommendations for changing to more relaxed settings

  1 Modify to more relaxed settings

The following settings are adjusted for both same subnet and cross-region solutions deployment of AlwaysOn availability groups. (Note: SameSubnetDelay and CrossSubnetDelay values are in milliseconds):

PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 2000

PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 15

  2 Verify the changes

PS C:\Windows\system32> get-cluster | fl *subnet* 

 

 

Verify the product of your Delay and Threshold is greater than your lease timeout value

By default, SQL Server configures the availability group Lease timeout at 20000 ms. The above recommended adjustment ensures that the new Delay and Threshold settings (30000 ms) is greater than the default availability group lease timeout (20000 ms).

Confirm the lease timeout property

 1 Launch Failover Cluster Manager. Click on Roles in the left pane.

 2 In the Roles pane, click on the availability group resource.

 3 In the Resource pane, click the Resources tab and right-click the availability group resource and choose Properties. Click the Properties tab to view the availability group properties which includes LeaseTimeout.

 

 References

How It Works: SQL Server AlwaysOn LeaseTimeout
http://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx

For more information on tuning Windows Cluster network configuration settings see:
Tuning Failover Cluster Network Thresholds     
http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx 

For information on using cluster.exe to tune Windows Cluster network configuration settings see:
How to Configure Cluster Networks for a Failover Cluster    
http://technet.microsoft.com/en-us/library/bb690953(v=EXCHG.80).aspx

 

Connection Timeouts in Multi-subnet Availability Group

$
0
0

THE DEFINITION

One of the issues that generates a lot of call volume we see on the AlwaysOn team is dealing with connectivity issues to the availability group listener in multi-subnet environments.

A “multi-subnet” environment is defined when the OS cluster used as the backbone for AlwaysOn has server nodes that are located in multiple, different subnets. Usually there are only 2 subnets, however, there can be more.

When the availability group listener (AGL) is configured properly, it will have an IP address for each defined subnet and have an “OR” dependency on each of the IP addresses. By default, when it is brought online it will be registered in DNS by Windows Cluster. The cluster will submit all of the IP addresses that are in the dependency list and the DNS server will generally register an A record for each IP address. (If non Microsoft Windows DNS servers are used, the exact implementation can be different).

When a client operating system (OS) needs to resolve the AGL name to IP by querying the DNS server, the DNS server will return multiple IP addresses – one for each subnet. The listener IP address in the subnet currently hosting the availability group primary replica will be online. The other listener IP address(es) will be offline. Because not all of the IP addresses returned by DNS will be online, client applications can run into problems when attempting to connect to the listener.

THE PROBLEM

By default, the behavior of the SQL client libraries is to try all IP addresses returned by the DNS lookup - one after another (serially) until the all of the IP addresses have been exhausted and either a connection is made, or a connection timeout threshold has been reached. This can be problematic, because depending upon DNS configurations, the “correct” or “online” IP address may not be the first IP address returned. The default timeout for a TCP connection attempt is 21 seconds and if the first IP address attempted is not online, it will wait 21 seconds before attempting the next IP address. For each subsequent IP address, it will again have to wait 21 seconds before moving to the next IP address until the connection attempt times out or it establishes a connection to an IP address that responds.

The default connection timeout period for .NET client libraries is 15 seconds, therefore, some applications may experience intermittent connection timeouts – or delays in connecting – which can cause application delays or performance issues.

THE RESOLUTION

Beginning with the native client libraries for SQL 2012 as well as the .NET 4.5 libraries (earlier .NET libraries with hotfixes – see below in Appendix A), Microsoft added a new connection string parameter that can be added to change the connection behavior. This new parameter, MultiSubnetFailover, should be used and set to “TRUE.”  When set to TRUE, the connection attempt behavior changes. It will no longer attempt all of the IP addresses serially, but in parallel. That is, all of the IP addresses that the availability group listener is dependent on will receive a SYN request at the TCP layer “in parallel” (technically one immediately after the other, but not waiting for acknowledgement – so effectively “in parallel”). This means that whichever IP address is online will be attempted immediately rather than waiting for any timeouts on IP addresses that are not online. The server will respond immediately and establish a connection, while the other IP addresses and their respective connection attempts will eventually timeout – but since the application is already connected it does not matter that those connection attempts timeout and fail.

THE PROBLEM 2 – THE SEQUEL

The client libraries by default do not enable this parameter (i.e. it is set to “FALSE”). The connection strings must be modified in order to ensure successful consistent and successful connections to a multi-subnet listener. Sometimes it isn’t possible to modify the connection strings – and so some applications will still encounter timeout issues when trying to connect. In some cases, the connection timeouts can be intermittent or they can be very consistent - depending upon the order in which IP addresses are returned.

RESOULTION 2 – THE SEQUEL

One option to resolve the issue if an application cannot use the MultiSubnetFailover parameter is to change the behavior of how the AGL is registered with DNS. This assumes that dynamic updating of DNS is allowed within the environment. If you are unsure, please check with your DNS administrators to determine if dynamic updating of DNS is allowed.

There are two parameters that affect how the AGL is registered with DNS. By modifying these parameters on the server we can transparently change the experience of the client OS in its name resolution caching.

The first parameter of interest is called RegisterAllProvidersIP. This parameter determines whether the Windows Cluster will register all of the IP addresses the AGL is dependent on, or only the one active IP address. When set to 1 (default if the AGL is created from SQL Server), the AGL clustered resource is created with all of the IP addresses the AGL is dependent on, registered in DNS. When set to 0, only the one active IP address is registered in DNS (the IP address in the subnet hosting the primary replica). (NOTE: if a Client Access Point is created using Windows Failover Cluster Manager, the RegisterAllProvidersIP parameter is set to 0 by default.)

The second parameter is called HostRecordTTL. This parameter governs how long (in seconds) before cached DNS entries on a client OS are expired, forcing the client OS to re-query the DNS server again to obtain the current IP address. By default, this value is 1200 (20 minutes). This means that after a client OS makes a call to the DNS server to resolve this name to an IP address, the client OS will cache that value for 20 minutes, only querying the DNS server again after that cached record expires. If this value is reduced to 120 or 60 for example, then the client OS cached copy will expire much more quickly.

This is important because during a failover in which the primary replica moves from one subnet to the other, the old IP address that was online is un-registered, and the new IP address that is brought online is registered. This updates DNS with the new IP address as soon as the AGL comes online, but client operating systems will not resolve the AGL name to the new IP address until the currently cached entry expires, which if it had just re-queried DNS immediately before the failover, the client OS would have to wait up to 20 minutes before expiring its cached copy and querying the DNS server again to get the new IP address. This causes the client OS to continue trying to connect to the OLD IP address until its cached copy has expired. By changing the HostRecordTTL parameter value to a much lower setting than 1200, it will cause the cached value to expire more quickly. So if set to 60, the client OS will only have to wait at most 60 seconds after a failover before acquiring the new IP address – allowing client operating systems to resolve to the new, correct IP address much sooner.

The drawback to setting the value to a lower number is how often the client OS will query the DNS server. If you have a handful of application servers, then changing the value from 1200 to 60 would probably have no perceptible impact on the DNS server(s). However, if there are thousands of client machines that all must resolve the AGL name to IP, this increases the load on the DNS server(s) and could cause problems.

A balance must be drawn between the lowest possible cache expiration time and the increased DNS server load.

The following PowerShell instructions show how to change the RegisterAllProvidersIP and HostRecordTTL settings. It is important to note that these settings cannot take effect until the AGL is brought offline and then online again, forcing it to re-register with DNS. Remember, the availability group is dependent on the AGL. If the AGL goes offline, so will the availability group and the databases. However, this dependency can be temporarily removed, allowing for the OFFLINE and re-ONLINE of the AGL without taking the availability group offline.

If there are applications or users that are actively using the AGL to connect to a replica (primary or secondary with read-only Routing), then the OFFLINE/ONLINE process will cause service interruption regardless of dependencies. Therefore, if the AGL is in use, perform the following steps during a maintenance window.


STEPS TO CHANGE

Availability group listener resource parameters:  RegisterAllProvidersIP and HostRecordTTL

The following steps show how to turn off RegisterAllProvidersIP and reduce the Client OS DNS cache timeouts (HostRecordTTL) parameters, in the event that you cannot use the MultiSubnetFailover=True parameter on all connection strings.

If you prefer, there are two sample scripts (one TSQL and one PowerShell) in Appendix B at the end of this document that has all of the required commands already configured.  These scripts can then be executed to perform all of the steps necessary.

NOTES:

  • The “>” symbol at the beginning of each line represents the command prompt and should not be typed.
  • Resource names are listed inside brackets (“<” and “>”) – do not include the brackets when typing the command for execution.
       

Before making any changes, it is necessary to get the PORT values for each listener defined in the AG.  This is because temporarily removing the cluster resource dependencies for the AG resource on the AGL will eliminate the port assignment for the listener.  If the port assigned was not 1433, the port needs to be specified again for the listener.   Capturing the existing port assignments before making any changes will allow the restoration of the proper port assignments at the end of the script.

1.  Capture the existing port assignment for each listener in the AG.

In SQL Server Management Studio, connect to the AG primary node execute the following TSQL and then keep the results to execute after all of the PowerShell commands have been completed.

      

SELECT '-- (1) Copy/Paste the results of this query '
    + ' into a query window.'
    AS [Generated TSQL script:]
UNION
SELECT '-- (2) After all PowerShell scripts/command'
    + ' have been executed,'
UNION
SELECT '-- (3) Execute the following TSQL commands'
    + ' to restore PORT settings.'
UNION
SELECT 'ALTER AVAILABILITY GROUP '
    + ag.name + ' MODIFY LISTENER '''
    + agl.dns_name + ''' (PORT = '
    + CAST(ISNULL(agl.port,1433) AS VARCHAR(5)) + ');'
    + CHAR(13) + CHAR(10)
    FROM sys.availability_group_listeners agl
         INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id

The above script should yield one or more TSQL statements that can be copied and pasted into a query window in SSMS later to re-configure the port for each listener.

image

image

2. On any one of the nodes in the cluster, open an administrative PowerShell window.

3. Get the cluster resource name for the availability group resource and the availability group listener resource using the following commands:

  • >Import-Module FailoverClusters
  • >Get-ClusterResource

This will produce an output similar to the following:

image

In this list of resources, we will concern ourselves with three different columns:

  • the Name (left most, with heading surrounded in light green box)
  • the OwnerGroup (third column from left, with heading in orange box)
  • the ResourceType (right most column with heading in yellow box)

The scripts below require the use of the resource name (left most column) for the resource on which we will make changes.

To get the correct resource, first find the name of your availability group in the third column (OwnerGroup) (light orange box in picture below). Once you have found the correct group, then find the resources that we need to change. We will look for two types: “SQL Server Availability Group” and “Network Name”. The resource types will be found in the right most column. In the picture below, the availability group resource is underlined in red, and the network name resource (for the listener) is underlined in yellow.

image

For the subsequent steps, use the following resource names:

  • TestAG_TestAGList_”, to substitute for <AG Listener Resource Name>
  • TestAG”, to substitute for <AG RESOURCE NAME>

4. Change the parameters with the following commands:

  • >Get-ClusterResource <AG Listener Resource Name> | Set-ClusterParameter -Name HostRecordTTL -Value 120
  • >Get-ClusterResource <AG Listener Resource Name> | Set-ClusterParameter -Name RegisterAllProvidersIP -Value 0

image

As can be seen in the example above, the resource name for the “Network Name” type is used, “TestAG_TestAGList”. After the command is executed, a yellow warning message is shown that indicates the parameter change will not take effect until the resource is taken offline and then brought back online. This can be done during a normal availability group cluster failover, or through PowerShell script (later in this document). Similarly, the second command above can be issued to change the parameter RegisterAllProvidersIP. It, too, will return a yellow warning message identical to the one shown – indicating the parameter change will not take effect until the resource is taken offline and brought back online.

5. Temporarily remove dependency between the availability group resource and the listener name resource.

Because the listener name resource has to be taken offline and back online for the above changes to take effect, and the fact that the availability group resource is dependent on the listener name, simply taking the listener name resource offline will also take the availability group (and its databases) offline in the process. To avoid taking the availability group resource offline, the dependency that the availability group has on the listener name can be temporarily removed and then re-applied. This can either be done by the Windows Failover Cluster Manager utility or through PowerShell commands.

To remove the dependency using Windows Failover Cluster Manager:

  • Select the availability group resource.
  • Right click and select properties.
  • On the Properties dialog, navigate to the dependencies tab
  • Select the resource and click the “Delete” button and then “OK” to close the dialog box.

image

To remove the dependency using PowerShell:

  • >Remove-ClusterResourceDependency -Resource <AG RESOURCE NAME> -Provider <AG Listener Resource Name>

6. Offline and re-online the listener resource to force re-registration with DNS and complete the changes:

  • >Stop-ClusterResource <AG Listener Resource Name>
  • >Start-ClusterResource <AG Listener Resource Name>
To force updating DNS on Windows Server 2012 or 2012 R2:
  • >Get-ClusterResource <AG Listener Resource Name> | Update-ClusterNetworkNameResource

To force updating DNS on Windows Server 2008 or 2008R2:
  • >Cluster.exe RES <AG Listener Resource Name> /registerdns

7. Re-add the dependency of the AG resource on the Listener name resource. (The dependency should exist for proper function of the availability group and access to the databases within the availability group. Failure to re-add the dependency could cause unintended behavior of the availability group and database availability.)

To re-add the dependency using Windows Server Failover Cluster Manager:

  • Select the availability group resource.
  • Right click and select properties.
  • On the Properties dialog, navigate to the dependencies tab
  • Click the “drop down underneath “Resource” and select the listener name resource.
  • Click the “apply” button, then “OK” to close the dialog box.

image

To re-add the dependency using PowerShell execute the following command:

  • >Add-ClusterResourceDependency -Resource <AG RESOURCE NAME> -Provider <AG Listener Resource Name>

8. Verification – view the dependency to make sure it is re-applied, ensure all cluster resources for this AG are online, and review the parameters to make sure they’re set to the new value:

  • >Get-ClusterResourceDependency <AG RESOURCE NAME>
  • >Get-ClusterResource <AG Listener Resource Name> | Get-ClusterParameter HostRecordTTL, RegisterAllProvidersIP
  • >Get-ClusterResource

9. Re-configure Listener PORT settings.

In step 1 above, a TSQL script was executed that generated additional TSQL commands that will restore the original PORT settings for the AG Listeners.   Copy and paste the results from the query in step 1 into a query window in SSMS – connected to the primary and execute the TSQL.   The TSQL generated from step one should look similar to:

image

Do not type the TSQL from the above image, use the TSQL that was generated in step 1 on your machine!

After pasting into a query window, the generated TSQL text should look something similar to the following with ALTER AVAILABILITY GROUP statements inside TRY/CATCH blocks.

image

Upon execution it should return successful completion:

image

Finally, the following query will return the list of listeners and their port settings – for all availability groups on the machine:

/* this script will obtain the ports defined
  * for each availability group listener that
  * exists.*/
SELECT ag.name AS [Availability Group],
    agl.dns_name AS [Listener DNS Name],
    agl.port AS [Port]
    FROM sys.availability_group_listeners agl
        INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id
    ORDER BY ag.name, agl.dns_name

image

HELPFUL REFERENCES

FINAL NOTES

For older operating systems such as Windows 7 and Windows Server 2008 R2, it is recommended the hotfixes referenced below be applied to ensure connection timeouts do not occur. Because of an issue with TDX/TDI filter drivers, timeouts can still occur when connecting to a name with multiple IP addresses – even when the correct client libraries are used and the MultisubnetFailover=True parameter is specified. These drivers are usually installed as part of older security systems such as anti-virus and intrusion detection. Ensuring these hotfixes have been applied will help prevent connectivity timeouts. Please note however, there is no hotfix for Windows Server 2008, only Windows Server 7 and Windows Server 2008 R2. If using Windows Server 2008 as a client, please refer to the articles below for more options in resolving timeout issues.

Two additional things to consider with respect to registering the listener name with DNS after making changes to either the HostRecordTTL or RegisterAllProvidersIP parameters --- DNS replication and previous settings.

The DNS server that is contacted by the OS cluster when registering or de-registering hostnames may not be the same DNS server that clients are using to resolve names to IP addresses. If this is the case, then it is possible to have additional delays in the client’s ability to get freshly updated information – simply because the client’s DNS server may not have the updated information. DNS replication topology and configuration settings can cause additional delays before all changes are replicated throughout an enterprise network. If significant delays are experienced either after a failover, or when changing these parameters, the network or DNS administrator should be contacted to investigate the DNS replication topology for the enterprise to determine if the time required to replicate across the entire organization can be reduced.

The other item to consider is that the “previous settings” are most likely already cached on client machines. If the parameter settings were “default” prior to making any changes, then any cached entries on client machines will still have the “old” expiration setting (TTL) – which is 20 minutes. That means, that even after changing the RegisterAllProvidersIP and HostRecordTTL settings – and taking the resource offline and back online to take effect – previously cached entries are not automatically expired. The client must wait the for the current TTL setting before it expires its cached copy. This means that it could still be up to 20 minutes before a client will get the new settings.

After the changes have been made and cached entries have been expired, the new settings will take effect and any subsequent TTL expirations will take place based upon the new setting (for example after 60 or 120 seconds) rather than the original default value of 20 minutes. This can be expedited on client machines if necessary by issuing an IPCONFIG /FLUSHDNS command from an elevated command prompt. This will cause the client to expire all cached entries and re-query the DNS server to obtain the new settings.

APPENDIX A

Section 5.7.1 Client-Connectivity For AlwaysOn Availability Groups from: SQL Server 2012 Release Notes.

The following table summarizes driver support for AlwaysOn Availability Groups:

Driver

Multi-Subnet Failover

Application Intent

Read-Only Routing

Multi-Subnet Failover: Faster Single Subnet Endpoint Failover

Multi-Subnet Failover: Named Instance Resolution For SQL Clustered Instances

SQL Native Client 11.0 ODBC

Yes

Yes

Yes

Yes

Yes

SQL Native Client 11.0 OLEDB

No

Yes

Yes

No

No

ADO.NET with .NET Framework 4.0 with connectivity patch*

Yes

Yes

Yes

Yes

Yes

ADO.NET with .NET Framework 3.5 SP1 with connectivity patch **

Yes

Yes

Yes

Yes

Yes

Microsoft JDBC driver 4.0 for SQL Server

Yes

Yes

Yes

Yes

Yes

*Download the connectivity patch for ADO .NET with .NET Framework 4.0: http://support.microsoft.com/kb/2600211.

**Download the connectivity patch for ADO.NET with .NET Framework 3.5 SP1: http://support.microsoft.com/kb/2654347.

MultiSubnetFailover Keyword and Associated Features

MultiSubnetFailover is a new connection string keyword used to enable faster failover with AlwaysOn availability groups and AlwaysOn Failover Cluster Instances in SQL Server 2012. The following three sub-features are enabled when MultiSubnetFailover=True is set in connection string:

  • Faster multi-subnet failover to a multi-subnet listener for an AlwaysOn Availability Group or Failover Cluster Instances.
    • Named instance resolution to a multi-subnet AlwaysOn Failover Cluster Instance.
  • Faster single subnet failover to a single subnet listener for an AlwaysOn Availability Group or Failover Cluster Instances.
    • This feature is used when connecting to a listener that has a single IP in a single subnet. This performs more aggressive TCP connection retries to speed up single subnet failovers.
  • Named instance resolution to a multi-subnet AlwaysOn Failover Cluster Instance.
    • This is to add named instance resolution support for an AlwaysOn Failover Cluster Instances with multiple subnet endpoints.

MultiSubnetFailover=True Not Supported by NET Framework 3.5 or OLEDB

Issue: If your Availability Group or Failover Cluster Instance has a listener name (known as the network name or Client Access Point in the WSFC Cluster Manager) depending on multiple IP addresses from different subnets, and you are using either ADO.NET with .NET Framework 3.5SP1 or SQL Native Client 11.0 OLEDB, potentially, 50% of your client-connection requests to the availability group listener will hit a connection timeout.


APPENDIX B - SCRIPTS

This is the script that should be run first to collect the port assignments for the listeners and generate TSQL code to be executed after the PowerShell script to re-configure the port settings to their original values.

/* this script will obtain the ports defined
  * for each availability group listener that
  * exists.  If no port is defined, it will
  * assign it will use port 1433.
  * The output will show the TSQL syntax
  * to alter the listeners to apply the
  * same port values later, should they
  * need to be re-configured to the same
  * ports.*/

DECLARE @CRLF CHAR(2) = CHAR(13) + CHAR(10)
DECLARE @EndTryCatch VARCHAR(max) = 'END TRY' + @CRLF +
    'BEGIN CATCH' + @CRLF +
    'IF (@@ERROR <> 19468)' + @CRLF +
    'SELECT ERROR_NUMBER() AS ErrNum, ERROR_MESSAGE() AS ErrMsg' +
    @CRLF + 'END CATCH' + @CRLF

SELECT '-- (1) Copy/Paste the results of this query '
    + ' into a query window.'
    AS [Generated TSQL script:]
UNION
SELECT '-- (2) After all PowerShell scripts/command'
    + ' have been executed,'
UNION
SELECT '-- (3) Execute the following TSQL commands'
     + ' to restore PORT settings.'
UNION
SELECT 'BEGIN TRY' + @CRLF + 'ALTER AVAILABILITY GROUP '
    + ag.name + ' MODIFY LISTENER '''
    + agl.dns_name + ''' (PORT = '
    + CAST(ISNULL(agl.port,1433) AS VARCHAR(5)) + ');'
    + @CRLF + @EndTryCatch + @CRLF
    FROM sys.availability_group_listeners agl
        INNER JOIN sys.availability_groups ag
        ON agl.group_id = ag.group_id

PowerShell script to change the HostRecordTTL and RegisterAllProvidersIP settings.

There are five variables that need to be changed before executing the script. They are located toward the top of the script underneath “CHANGE THESE VARIABLES”. It is recommended you become familiar with the script and its options in a test environment before attempting in production. The script is written such that it will affect ALL availability group listener resources for the specified availability group, since if the parameters need to be changed for one listener, then most likely, if the availability group has more than one listener, they should all be changed.

        

#**************************************************************************
#This script is provided "AS IS" with no warranties, and confers no rights.
#   Use of included script samples are subject to the terms specified at
#   http://www.microsoft.com/info/cpyright.htm
#**************************************************************************

#**************************************************************************
#  VARIABLES
#$strAGName          the name of the availability group
#$TTLValue           the # of seconds for HostRecordTTL timeout value
#$AllIPs             [0 | 1] 0 = only register one IP, 1 = register all IPs
#$RestartListener    [0 | 1] 1 = restart listener / 0 = do not restart
#$RemoveDependencies [0 | 1] 1 = temporarily remove / 0 = leave alone
#**************************************************************************

#**************************************************************************
#
# CHANGE THESE VARIABLES
#Define Variables
$strAGName = "TestAG"         #<<<<<<<<<<<<<<<<<<<<<<<<<
$TTLValue = "120"             #<<<<<<<<<<<<<<<<<<<<<<<<<
$AllIPs = 0                   #<<<<<<<<<<<<<<<<<<<<<<<<<
$RestartListener = 1          #<<<<<<<<<<<<<<<<<<<<<<<<<
$RemoveDependencies = 1       #<<<<<<<<<<<<<<<<<<<<<<<<<
#
#**************************************************************************

#**************************************************************************
#Notes: 
#   1) Test this script in non-production environments first.
#   2) This script will change the parameters for _all_ listeners
#      for the specified availability group 
#   3) This script can optionally restart the listener(s)
#   4) if restaring listeners, it can optionally temporarily
#      remove and restore the dependencies to take the
#      listener(s) offline without taking the availability group
#      itself offline.  If choosing not to temporarily remove
#      and restore dependencies, then when the listener(s) are
#      taken offline, the availability group resource will also
#      go offline - thus making the databases in the AG inaccessible.
#   5) if choosing to remove dependencies, the existing depenedencies
#      are collected and restored after restaring the listener(s)
#   6) Windows Server 2012/2012R2 has a powershell command to
#      re-register listener(s) with DNS.  Server 2008/2008R2 does
#      does not.  there is logic to determine and use the CLUSTER.EXE
#      command for Windows Server 2008/2008R2
#**************************************************************************

#no changes required below this point

#Get OS version
$OSMajor = ([System.Environment]::OSVersion.Version).Major
$OSMinor = ([System.Environment]::OSVersion.Version).Minor

#load cluster module
Import-Module FailoverClusters

#get the cluster role (group) object based on the AG name provided above
$objAGGroup = Get-ClusterGroup $strAGName -ErrorAction SilentlyContinue

if ($objAGGroup -eq $null)
    {Write-Host "Error:  Availability Group not found."}
else
    {
    #get the AG resource object in this cluster role (group)
    $objAGRes = $objAGGroup | Get-ClusterResource |
        Where-Object {$_.ResourceType -match "SQL Server Availability Group*"}
    #get the listener(s) object(s) in this cluster role (group)
    $objListener = $objAGGroup | Get-ClusterResource |
          Where-Object {$_.ResourceType -match "Network Name*"}

    #change the parameter settings: HostRecordTTL & RegisterAllProvidersIP
    Write-Host "Making changes to Network Name:"  $list.Name
    $objListener | Set-ClusterParameter -Name HostRecordTTL -Value $TTLValue
    $objListener | Set-ClusterParameter -Name RegisterAllProvidersIP -Value @AllIPs
    $objListener | Get-ClusterParameter -Name HostRecordTTL
    $objListener | Get-ClusterParameter -Name RegisterAllProvidersIP

    if ($RestartListener -eq 1) {
        if($RemoveDependencies -eq 1) {
            #capture the dependency(ies) that the AG resource depends on
            $DepStr = ($objAGRes | Get-ClusterResourceDependency).DependencyExpression
            Write-Host "Removing dependecny for " $objAGRes.Name  " on '" $DepStr "'"
             Set-ClusterResourceDependency -Resource $objAGRes -Dependency $null
        } #if remove dependencies

        #restart the listener resource(es)
        Write-Host "Restarting Network Name resource:" $list.Name
        $objListener | Stop-ClusterResource
        $objListener | Start-ClusterResource

        #force re-registration in DNS
        if ($OSMajor -ge 6 -and $OSMinor -ge 2) {
            #Windows Server 2012 and up
            $objListener | Update-ClusterNetworkNameResource -Verbose
         }
        else {
            #for Windows Server 2008/2008R2
             ForEach($list in $objListener) {
                cluster.exe res $list.name  /registerdns
            }#foreach
         }
        if($RemoveDependencies -eq 1) {
            #restore the dependency(ies) to previous setting
            Write-Host "Reapplying dependencies for " $objAGRes.Name
            Set-ClusterResourceDependency -Resource $objAGRes -Dependency $DepStr
            #show dependency (so it can be compared) / show the settings
            $objAGRes | Get-ClusterResourceDependency
        } #if remove dependencies
        else {
        #if we chose not to remove dependencies we need to restart
        #the availability group resource
        $objAGRes | Start-ClusterResource
        }
    } #if restart
}#else - availability group found

Recommendations and Best Practices When Deploying SQL Server AlwaysOn Availability Groups in Microsoft Azure (IaaS)

$
0
0

 

Introduction

Microsoft Azure virtual machines (VMs) with SQL Server can help lower the cost of a high availability and disaster recovery (HADR) database solution. Most SQL Server HADR solutions are supported in Azure virtual machines, both as Azure-only and as hybrid solutions.

There are important considerations and unique configurations for successfully deploying AlwaysOn availability groups in an IaaS environment. This blog lists the key considerations that should be addressed when deploying availability groups in Windows Azure.

 

Windows and Cluster in IaaS Best Practices

 

Heartbeat Recommendations

 IaaS with SQL AlwaysOn - Tuning Failover Cluster Network Thresholds 

When running Windows Failover Clustering in IaaS with SQL Server AlwaysOn, changing the cluster setting to a more relaxed monitoring state is recommended. Cluster settings out of the box are restrictive and could cause unneeded outages. The default settings are designed for highly tuned on premises networks and does not take into account the possibility of induced latency caused by a multi-tenant environment such as Windows Azure (IaaS).

 Note: There is a known issue with Cluster Heartbeat settings being reset to default. To resolve this issue apply the following hotfix:

            For Windows 2012 R2 and 2008 SP2

Changed cluster properties revert to default values on cluster nodes that run Windows Server 2012 R2 or Windows Server 2008 SP2

            For Windows 2012

Changed cluster properties revert to default values on cluster nodes that run Windows Server 2012

 

Quorum Recommendations

WSFC Quorum Modes and Voting Configuration (SQL Server) 

SQL Server AlwaysOn Availability Groups takes advantage of Windows Server Failover Clustering (WSFC) as a platform technology. WSFC uses a quorum-based approach to monitoring overall cluster health and maximize node-level fault tolerance. A fundamental understanding of WSFC quorum modes and node voting configuration is very important to designing, operating, and troubleshooting your AlwaysOn high availability and disaster recovery solution.

Quorum vote configuration check in AlwaysOn Availability Group Wizards

In this blog we dig deeper into the guidelines of adjusting the quorum voting in the Windows Server Failover Cluster (WSFC) for the availability groups and explain the reasons behind them with a specific example.

Failover Clustering and AlwaysOn Availability Groups (SQL Server)

AlwaysOn Availability Groups, the high availability and disaster recovery solution introduced in SQL Server 2012, requires Windows Server Failover Clustering (WSFC).

The overall health of a WSFC cluster is determined by the votes of quorum of nodes in the cluster. If the WSFC cluster goes offline because of an unplanned disaster, or due to a persistent hardware or communications failure, manual administrative intervention is required. A Windows Server or WSFC cluster administrator will need to force a quorum and then bring the surviving cluster nodes back online in a non-fault-tolerant configuration.

 

Static IP for DIP / VIP

Network Isolation Options for Machines in Windows Azure Virtual Networks

Application isolation is an important concern in enterprise environments, as enterprise customers seek to protect various environments from unauthorized or unwanted access. This includes the classic front-end and back-end scenario where machines in a particular back-end network or sub-network may only allow certain clients or other computers to connect to a particular endpoint based on a whitelist of IP addresses. These scenarios can be readily implemented in Microsoft Azure whether client applications access virtual machine application servers from the internet, within the Azure environment, or from on-premises through a VPN connection.

 

Storage Account

Storage Spaces - Designing for Performance

Storage spaces can provide high performance storage in Azure when preparation and configuration is done using best practices. However, there are variables to be aware of that can impact the performance of the disks deployed in a storage space.

For example, performance can become degraded if the number of highly used VHDs for standard tier virtual machines approaches 40. For more information review the Configuring Azure Virtual Machines for Optimal Storage Performance.

SQL Server best practices also should be referenced for guidance on optimizing storage and increasing IOPS performance by implementing multiple disks. For more information see the sections 'Windows Azure virtual machine disks and cache settings' and 'Data disks performance options and considerations' in Performance Guidance for SQL Server in Azure Virtual Machines. Also review the section 'I/O performance considerations' of Performance Best Practices for SQL Server in Azure Virtual Machines.

Also, consider using the script available at Automate the creation of an Azure VM preconfigured for max storage performance to create a Microsoft Azure virtual machine optimized for maximum storage performance.

 

SQL Server in IaaS Best Practices

 

Deploying SQL Server AlwaysOn Availability Groups in Windows Azure

Offerings The following range from step by step configuration to deploy AlwaysOn availability groups to completely automated deployment.

Use the new SQL Server 2014 AlwaysOn gallery.

Tutorial: AlwaysOn Availability Groups in Azure (PowerShell)

Tutorial: AlwaysOn Availability Groups in Azure (GUI)

 

Best Practices for SQL Server in IaaS

Performance Best Practices for SQL Server in Azure Virtual Machines

Review the Best Practices check list to ensure you implement SQL Server tuning options when running in Microsoft Azure (IaaS) environments.

Performance Guidance for SQL Server in Azure Virtual Machines

Review the Performance Guidance whitepaper for an in-depth analysis and recommendations for optimizing SQL Server performance when running in Microsoft Azure (IaaS) environments.

When deploying AlwaysOn availability groups, it is recommended that SQL Server is deployed using the Performance Guidance for SQL Server in Azure Virtual Machines whitepaper and the Performance Best Practices for SQL Server in Azure Virtual Machines guidelines in order to maximize the performance AlwaysOn availability groups running on . 

 

Create Azure Availability Group Listener

Tutorial: Listener Configuration for AlwaysOn Availability Groups

AlwaysOn availability group listeners require unique configuration steps. Since Microsoft Azure virtual machines have static ip address restrictions, the availability group listener must be configured to implement your cloud service ip address for connectivity to your availability group databases.

Use the link Tutorial: Listener Configuration for AlwaysOn Availability Groups for step by step instructions on configuring the Azure availability group listener.

 

Testing your Azure Availability Group Listener

Once properly configured, an Azure availability group listener can be connected to from anywhere on the internet.

IMPORTANT: Testing the Azure listener on a workstation or server within the same cloud service hosting the SQL Server hosting the primary replica is not a valid test and will fail. The Azure listener is designed for connectivity from an application running in a different cloud service or connecting across the internet.

For example, on a workstation running in a different cloud service than the availability group, and on which SQL Server client tools are installed, use the following command line to test the connection to the SQL Server hosting the primary replica of your availability group:

sqlcmd –S "<CloudServiceDNSName>,<EndpointPort>" -d "<DatabaseName>" -Q "select @@servername, db_name()" -l 15

          
The DNS name can be found by viewing the dashboard of your cloud service hosting the AlwaysOn availability group using the Azure portal.

The EndpointPort is the public port you specified in step eight of the during setup of the Azure listener in Tutorial: Listener Configuration for AlwaysOn Availability Groups.

 

Use ReadIntent Routing with Azure AlwaysOn Availability Group Listener

Use ReadIntent Routing with Azure AlwaysOn Availability Group Listener

The Azure listener can be configured to route read intent connection requests. It requires unique configuration steps when deployed in Microsoft Azure. For more information and step-by-step instructions on how to configure ReadIntent routing using the Azure listener, review the Use ReadIntent Routing with Azure AlwaysOn Availability Group Listener to properly configure your Azure Listener for read-only routing.

Enhance AlwaysOn Failover Policy to Test SQL Server Responsiveness

$
0
0

The AlwaysOn failover policy for AlwaysOn availability groups monitors the health of the SQL Server process hosting the primary replica. For example, one health check mechanism ensures that SQL Server is responsive. The SQL Server Resource DLL establishes a local ODBC connection and SQL Server responds to the session within the availability group's HEALTH_CHECK_TIMEOUT setting, which is 30 seconds, by default.

SQL Server responds to the ODBC session using a thread running at ABOVE NORMAL PRIORITY. As a result, SQL Server could develop a health issue that impacts SQL Server worker threads running at NORMAL PRIORITY, servicing your application sessions, but is able to respond to the SQL Resource DLL health check within the HEALTH_CHECK_TIMEOUT period.

In this scenario AlwaysOn failover policy is unable to detect a health issue with SQL Server even though it is not responding to your application's connection and query requests as expected.

How to Check that SQL Server is Responsive to User Sessions

You may add a generic script clustered resource to the availability group, which connects to SQL Server and performs a simple query. As a member resource of the availability group resource, if the script resource fails to connect or execute the query, Windows Cluster will attempt to restart or failover the availability group resource.

Attached to this blog is a zipped file, GenericScript_SQLIsAlive.zip, containing

sqlisalive.vbs <- The generic script written in Visual Basic Script and implements Windows Cluster  IsAlive

Add_SQLIsAliveScript.ps1 <- The PowerShell script which adds the generic script resource to your availability group resource

readme.txt <- Step by step instructions for implementing the generic script resource and additional instructions on how to test the script.

The generic script resource executes IsAlive every 60 seconds and makes an ODBC connection to the local SQL Server hosting the availability group's primary replica. It tests for successful connection and query execution. It can also connect in the context of a designated availability group database to test database access.

Implement the generic script resource

 I Configure the generic script

SQL Server Instance The generic script, sqlisalive.vbs, is configured to connect to the default instance of SQL Server using a local pipe connection at both the primary and secondary, if deployed as is. If this is not the case in your environment, for example, the primary is hosted on the default instance of SQL Server and the secondary replica is hosted on a named instance of SQL Server, the script will need to be uniquely modified on each server, to connect to the local instance using the connection string below:

sConn="Driver={SQL Server Native Client 11.0};Server=lpc:(local);Database=" & Db & ";Trusted_Connection=Yes"

Connection and Command Timeouts The generic script, sqlisalive.vbs, contains the implementation of the IsAlive function, so the script is executed every 60 seconds. By default the script is configured with the following connection and query timeout settings. Change these to settings to thresholds that reflect the delay you intend your application to tolerate. The default in the script is:

oConn.ConnectionTimeout = 20

oConn.CommandTimeout = 10

Database Health Check AlwaysOn failover policy is designed to monitor the health of the SQL Server process hosting the primary replica of your availability group and does not detect the health of availability databases. The generic script is configured to connect to the master database. The script can be modified to test for availability database access by changing the Db variable to the desired availability database:

Db="master"

IMPORTANT On failover, the availability group and its resources will need to successfully test the new primary replica environment. If you intend to test database health with the script resource, be aware that long database recovery on the new primary may cause the script resource failing which would prevent the availability group from coming online on the new primary replica. If long recovery is a possibility on failover, steps must be taken in the script to account for this delay if database health is to be detected.

NOTE: It is best to force a local conneciton (lpc:servername or lpc:(local)) to make sure the script is connecting only to the local primary instance. The attached generic script uses a local connection.

 II Configure and Execute the Powershell script to Deploy the Generic script

1 Ensure your availability group has two replicas configured for automatic failover.

2 Copy the generic script file, sqlisalive.vbs to an identical local storage location like 'C:\temp\sqlisalive.vbs' on both servers whose replicas are configured for automatic failover. 

3 On the SQL Server hosting the availability group primary replica, launch Windows PowerShell ISE as Administrator.

4 Use File / Open to open the Add_SQLIsAliveScript.ps1 script.

5 In the Add_SQLIsAliveScript.ps1, change the availability group resource variable '$ag' at the top of the script to your availability group name.

6 For the '$scriptfilepath' variable, set the correct path/file name in Add_SQLIsAliveScript.ps1 to the location of the generic script file on both servers. The sample PowerShell script currently uses C:\temp\sqlisalive.vbs.

7 Execute the modified PowerShell script Add_SQLIsAliveScript.ps1 to add the generic script resource to your availability group resource group.

8 Launch Failover Cluster Manager and review the availability group resource group to confirm addition of the generic script resource to the availability group resource group. The generic script should appear and come online in the availability group resource group under the Resources tab.

 

 

NOTES

 The attached readme.txt file has instructions on how to test the script resource to ensure that it can failover your availability group resource.

 

Diagnose failure detection by generic script resource

Generate the cluster log for the node hosting the primary replica and search for 'SQLCommandFailed' or SQLConnectionFailed' - you should find the error and description. For example, on a query timeout, the following is found in the Cluster log:

00000b58.00000600::2014/10/09-17:56:05.307 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive

00000b58.00000600::2014/10/09-17:56:05.307 INFO  [RES] Generic Script <sqlisalive>: IsAlive SQLCommandFailed

00000b58.00000600::2014/10/09-17:56:05.307 INFO  [RES] Generic Script <sqlisalive>: Error: -2147217871; Native Error: 0; SQL State: S1T00; Source: Microsoft OLE DB Provider for ODBC Drivers

00000b58.00000600::2014/10/09-17:56:05.307 INFO  [RES] Generic Script <sqlisalive>: [Microsoft][SQL Server Native Client 11.0]Query timeout expired

00000b58.00000600::2014/10/09-17:56:05.307 ERR   [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'

 

Or for example, there was a problem detected connecting to SQL Server:

00001f4c.0000149c::2014/10/09-18:03:45.083 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive

00001f4c.0000149c::2014/10/09-18:03:45.133 INFO  [RES] Generic Script <sqlisalive>: IsAlive SQLConnectionFailed

00001f4c.0000149c::2014/10/09-18:03:45.133 INFO  [RES] Generic Script <sqlisalive>: Error: -2147217843; Native Error: 18456; SQL State: 28000; Source: Microsoft OLE DB Provider for ODBC Drivers

00001f4c.0000149c::2014/10/09-18:03:45.133 INFO  [RES] Generic Script <sqlisalive>: [Microsoft][SQL Server Native Client 11.0][SQL Server]Login failed for user ''.

00001f4c.0000149c::2014/10/09-18:03:45.133 ERR   [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'

 

 

 

Determine Availability Group Synchronization State, Minimize Data Loss When Quorum is Forced

$
0
0

When Windows Cluster quorum is lost either due to a short term network issue, or a disaster causes long term down time for the server that hosted your primary replica, and forcing quorum is required in order to quickly bring your availability group resource online, a number of circumstances should be considered to eliminate or reduce data loss.

The following scenarios are discussed in order beginning with best-case scenario for minimizing data loss, in the event that quorum must be forced, given these different circumstances:

  1. The availability of the original primary replica
  2. The replica availability mode (synchronous or asynchronous)
  3. The synchronization state of the secondary replicas (SYNCHRONIZED, SYNCHRONIZING, NOT_SYNCHRONIZED)
  4. The number of secondary replicas available for failover

IMPORTANT When you must force quorum to restore your Windows cluster and availability groups to productivity, your SQL Server availability groups will NOT transition back to their original PRIMARY and SECONDARY roles. After you have restarted the Cluster service using forced quorum on any recoverable replicas, availability groups transition to the RESOLVING state, and require an additional failover command to bring back into the PRIMARY and SECONDARY roles by executing the following command on the SQL Server instance where you intend to bring the availability group back online in the primary role:

ALTER AVAILABILITY GROUP agname FORCE_FAILOVER_ALLOW_DATA_LOSS

 This is by design - forcing quorum is an administrative override and no assumption will be made about where the primary replica should come online.

Scenario - Quorum was lost, my primary replica is still available - failover to original primary

 If at all possible, recover the server hosting the original primary replica, if it is available. This guarantees no data loss. If quorum is lost due to a network or some other issue, that leaves the server that was hosting the original primary replica intact, force quorum (if Cluster does not or cannot recover on its own) on the server hosting the original primary replica.

At this stage Windows Cluster will be running, but your availability group will still be reported as RESOLVING and no availability databases will be accessible. Despite being the original primary, you still must failover the availability group, and the only syntax supported after forcing of quorum is: 

ALTER AVAILABILITY GROUP agname FORCE_FAILOVER_ALLOW_DATA_LOSS

This step is necessary for the availability group to transition to the primary role. Rest assured, despite the fact that you are issuing the command '...FORCE_FAILOVER_ALLOW_DATA_LOSS' so long as this was the primary, no data loss is incurred.

Scenario - I have lost my primary - failover to a synchronous secondary that is SYNCHRONIZED

When a disaster occurs which results in the long term or permanent loss of your primary replica, to avoid data loss, fail over to a secondary replica 1) whose availability mode is synchronous and 2) is in the SYNCHRONIZED state. So long as the replica was in a SYNCHRONIZED state or your key databases were in a SYNCHRONIZED state at the time that quorum was lost, you can failover to this replica and be assured of no data loss.

In order to determine if your synchronous secondary is in a SYNCHRONIZED state

  1. Force quorum to start the Cluster service on a server that hosted a secondary replica configured for synchronous commit.

  2. Connect to SQL Server and query sys.dm_hadr_database_replica_cluster_states.is_failover_ready. The DMV returns a row for each database in each availability group replica. Here is a sample query that returns each availability database name and its failover ready state at each replica. This query reports the IS_FAILOVER_READY state of each availability database at each replica, ie, if those databases at each replica are in a SYNCHRONIZED state. For example, you force quorum on and connect to SQL Server on SQLNODE2 which is a synchronous commit secondary replica. The query reports several key databases are not failover ready on SQLNODE2, but the same query results also report that the same databases are synchronized on SQLNODE3.

select arc.replica_server_name, drc.is_failover_ready, drc.database_name
from sys.dm_hadr_database_replica_cluster_states drc
join sys.dm_hadr_availability_replica_cluster_states arc
on(drc.replica_id = arc.replica_id) --where arc.replica_server_name = @@servername

                           IMPORTANT: Do this step before you failover your availability group

 

 

This may lead you to shut down Cluster on SQLNODE2 and force quorum on SQLNODE3, where you can issue the ALTER AVAILABILITY GROUP...FORCE_FAILOVER_ALLOW_DATA_LOSS command to bring the availability databases into the primary role.

Scenario - I have lost my primary - minimize data loss on failover to SYNCHRONIZING or NOT_SYNCHRONIZED secondary replica

If quorum has been lost and the primary is not available, in the event that you must consider failover to a synchronous commit replica that is not failover ready (sys.dm_hadr_database_replica_cluster_states.is_failover_ready=0), or an asynchronous commit replica, there are further steps that can be taken to minimize data loss in the event that you have more than one secondary replica to choose from.

Choose secondary to failover by comparing database hardened LSN in SQL Server Error Log (SQL Server 2014 only) 

In SQL Server 2014, the last hardened LSN information is reported in the SQL Server error log when an availability replica transitions to the RESOLVING role. Open the SQL Server error log and locate where each key availability database transitions from SECONDARY to RESOLVING. Immediately following is additional logged information - the key comparator is the 'Hardened Lsn.'

For example, here we check the following SQL Error Log entries from our two secondaries - SQLNODE2 and SQLNODE3 as they transition to RESOLVING:

SQLNODE2

2014-11-25 12:39:32.43 spid68s     The availability group database "agdb" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2014-11-25 12:39:32.43 spid68s     State information for database 'agdb' -
Hardended Lsn: '(77:5368:1)'    Commit LSN: '(76:17176:3)'    Commit Time: 'Nov 25 2014 12:39PM'

Let's also review the same information in the SQL Server error log on secondary SQLNODE3:

SQLNODE3

2014-11-25 12:39:32.52 spid51s     The availability group database "agdb" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2014-11-25 12:39:32.52 spid51s     State information for database 'agdb' -
Hardended Lsn: '(72:1600:1)'    Commit LSN: '(72:1192:3)'    Commit Time: 'Nov 25 2014 12:39PM'

SQLNODE2 has a more advanced hardened LSN. Failing over to SQLNODE2 will minimize data loss.

Choose secondary to failover by failing over each secondary then querying sys.dm_hadr_database_replica_states.last_hardened_lsn (SQL Server 2012)

WARNING The following steps suggest forcing quorum on separate Cluster nodes, to determine the progress of log replication to the availability group replica. Prior to performing these steps configure the Cluster services on each available node to Manual startup, otherwise, you run the risk of creating more than one quorum node set; that is a split-brain scenario. Ensure that whenever forcing quorum ensure that no other node is currently running with a forced quorum - this can result in a split-brain scenario.

Since SQL Server 2012 does not report the last_hardened_lsn in the SQL Server error log when transitioning to RESOLVING, you can examine the last hardened LSN on each secondary by forcing quorum and failing over the availability group on one secondary and then the other secondary, each time querying sys.dm_hadr_database_replica_states.last_hardened_lsn. Consider the example, with SQLNODE2 and SQLNODE3 secondary replicas:

 1 Force quorum on SQLNODE2.

 2 Connect to the local instance of SQL Server on SQLNODE2 and force failover of the availability group:

ALTER AVAILABILITY GROUP agname FORCE_FAILOVER_ALLOW_DATA_LOSS

 3 Query for the last_hardened_lsn of your availability databases:

select drs.last_hardened_lsn, arc.replica_server_name, drc.is_failover_ready,
drc.database_name from
sys.dm_hadr_database_replica_cluster_states drc join
sys.dm_hadr_availability_replica_cluster_states arc
on(drc.replica_id = arc.replica_id)
join sys.dm_hadr_database_replica_states drs
on drs.replica_id=arc.replica_id


 4 Stop the Cluster service on SQLNODE2.

 5 Force quorum on SQLNODE3.

 6 Connect to the local instance of SQL Server on SQLNODE3 and force failover of the availability group:

ALTER AVAILABILITY GROUP agname FORCE_FAILOVER_ALLOW_DATA_LOSS

7 Use the same query in step 3 above, to query last_hardened_lsn for your availability databases and compare those values to the results querying SQLNODE2.

If SQLNODE3 returned a more advanced last_hardened_lsn, then continue to move towards resuming production with SQLNODE3 hosting the primary replica.

If SQLNODE2 returned a more advanced last_hardened_lsn, stop the Clusters service on SQLNODE3 and then force quorum on SQLNODE2. Failover the availability group to bring the primary replica online in SQLNODE2.

When connectivity is restored between the nodes of the Cluster, the remaining replicas will transition back to the SECONDARY role and their availability databases will be in a SUSPENDED state. 

Finally, an alternative is to drop the availability group on each secondary and RESTORE AGDB WITH RECOVERY - this will recover the database and then you can individually inspect the databases at each node to confirm the databases with the most changes. This may require knowledge of key tables whose timestamp columns can be queried. Then, you can recreate the availability group with a single primary replica and add your listener.

 


Troubleshooting the Add Azure Replica Wizard in SQL Server 2014

$
0
0

SQL Server 2014 introduced a new feature that automates the creation of a new AlwaysOn replica hosted in Microsoft Azure. This replica is added to an existing availability group in your environment and connected to your existing AlwaysOn deployment via VPN.

 

 

Index:

    • Introduction:
    • Pre-requisites:
    • Known issues:
      • Add Azure Replica Wizard – Virtual Network drop down is empty
      • Provisioning new virtual machine fails with error ‘The geo-location constraint…is invalid’ from Add Azure Replica:
      • Validation Error ‘Checking if the cluster name resource is online’ Fails
      • The Add Replica to Availability Group wizard fail during group validation. It fails on:    “Checking if the cluster name resource is online”
      • SQL Add Azure Replica Wizard Fails on Step ‘Configuring Endpoints.’ With error 53 (The network path was not found) and Step ‘Provisioning Windows Azure VM with error 1722 (The RPC server is unavailable)
      • SQL Add Azure Replica Wizard fails with ‘The hosted service name is invalid’
      • SQL Add Azure Replica Wizard Fails on Step ‘Joining secondary replicas to availability group’
      • VPN cannot be behind a NAT (Network Address Translation) device.
    • Troubleshooting Tips

Introduction:

As part of this new feature, the Add Azure Replica wizard will create a new Virtual Machine in Microsoft Azure using an image that already has SQL Server 2014 installed. During the creation (Provisioning phase) the new Virtual Machine will join your current on premises domain. In addition after the provisioning phase completes it will install and configure the Windows Failover Clustering feature, and configure SQL Server to be a new secondary replica to your existing on premises availability group.


There is a high level of automation to complete the tasks performed by the SQL Server Add Azure Replica Wizard. As a result there are some unexpected failures that can occur, and this blog will cover some of the known common issues. In addition to covering the common failures we will present you with some basic troubleshooting steps to try and resolve any issue that is not covered in this Blog.

Note: This Blog does not provide a Step by Step guide to using the Add Azure Replica Wizard. The link below is a tutorial that can walk you through the steps inside the new SQL Server 2014 Add Azure Replica Wizard.


Tutorial: Add Azure Replica Wizard

 

Pre-requisites:

Listed below are some of the prerequisites that are required for a successful replica deployment using the Add Azure Replica Wizard. For a more comprehensive list go to the following:


Use the Add Replica to Availability Group Wizard (SQL Server Management Studio)

 

  1. The Add Azure Replica Wizard must be executed from the host that is the current primary replica.
  2. Note your current OS version and SQL Server version. During the wizard you will need to ensure you select the correct Virtual Machine Image. If you are running Windows 2012 you need to select the same OS version since Windows Failover Cluster does not allow for mixed OS versions. (Example, Windows 2012 and Windows 2012R2 cannot be part of the same cluster or availability group).
  3. If a server instance that you select to host an availability replica is running under a domain user account and does not yet have a database mirroring endpoint, the wizard can create the endpoint and grant CONNECT permission to the server instance service account. However, if the SQL Server service is running as a built-in account, such as Local System, Local Service, or Network Service, or a Local account, you must use certificates for endpoint authentication, and the wizard will be unable to create a database AlwaysOn endpoint on the server instance. In this case, we recommend that you create the database endpoints manually before you launch the Add Replica to Availability Group Wizard.
  4. The specified Windows user account specified in the Add Azure Replica Wizard must have privileges to Add Workstations to the Domain (Create Computer Objects) as well as have local administrator privileges on each node of the current Windows Failover Cluster that is hosting the AlwaysOn availability group.
  5. You will need to specify a network share in order for the wizard to create and access backups.
  6. For the primary replica, the account used to start the Database Engine must have read and write file-system permissions on a network share. For secondary replicas, the account must have read permission on the network share.

 

 

Known issues:

Issue:

Add Azure Replica Wizard – Virtual Network drop down is empty

Cause:

After the SQL 2014 Add Azure Replica Wizard was shipped a significant architecture change was made in the Windows Azure environment, specifically with Network functionality.

Prior to the hotfix mentioned in the fix section below, the SQL Add Azure Replica Wizard depended on Affinity Groups, currently the Affinity Groups are no longer being used by Microsoft Azure environment resulting in the error you see below:

image

Previously, when creating a virtual network (VNet) you were required to associate the VNet with an affinity group, which was in turn, associated with a Region. This requirement has changed. Now VNets are associated directly with a Region (Location) in the Management Portal. This allows you more freedom when creating your VNets.

For more information on this change click the link below:
About Regional VNets and Affinity Groups for Virtual Network

Fix:

To resolve this issue prior to running the SQL Server Add Azure Replica Wizard apply CU5 or later CU / Service Pack for SQL 2014. As an alternative you can download a patch that was created prior to the release of CU5. For more information click on the link below:
FIX: Add Azure Replica wizard cannot enumerate Azure Virtual Network in SQL Server 2014

 

Issue:

Provisioning new virtual machine fails with error ‘The geo-location constraint…is invalid’ from Add Azure Replica:

Cause:

SQL Server Add Azure Replica make reference to a Virtual Network configured with Affinity Group setting. As mentioned above in the previous issue, the Affinity Group option is deprecated currently in Microsoft Azure environments.

 

This issue will manifest itself with the following errors:

The wizard fails and clicking the Error link reports ‘geo-location constraint’ error like below,

clip_image001

You can look at the Add Azure Replica Wizard log (found in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard reports:

 

In this log you will see an error similar to the following:

Attempting to provision Windows Azure VM 'vmname' resulted in an error. (Microsoft.SQLServer.Management.HadrTasks)
Additional Information:
OperationID:1190164893f0120f97861e2cd5c47c8f, Status=Failed, Code=400, Details=The geo-location constraint specified for the hosted service is invalid.
(Microsoft.SQLServer.Management.HadrTasks)

Fix:

To resolve this issue prior to running the SQL Server Add Azure Replica Wizard apply CU5 or later CU / Service Pack for SQL 2014. As an alternative you can download a patch that was created prior to the release of CU5. For more information click on the link below:

FIX: Add Azure Replica wizard cannot enumerate Azure Virtual Network in SQL Server 2014

 

 

 

Issue(s):

Validation Error ‘Checking if the cluster name resource is online’ fails

The Add Replica to Availability Group wizard fail during group validation. It fails on:

“Checking if the cluster name resource is online”

clip_image002

Error is – “Access is denied.”

clip_image003

From Details you can see:

clip_image005

Cause:

As part of the validation process, the Add Replica wizard tries to connect to the local windows cluster and ensure that the cluster network name is online and available. When trying to perform this check an error can result for several reasons. Below in the Fix section we have we have listed the top reasons for failure.

Fix:

To resolve this issue verify the following items below:

1. The Cluster Network Name is not in an online state.

Verify it is online and not offline or failed by using the Failover Cluster Manager and looking at the Core Resources. Below is an Example where the Cluster Network Name resource is offline resulting in the error above.

clip_image006

2. The current logged in user does needs to have full control permissions to the cluster

To connect to and manage a Windows Failover cluster, there is an Access List that is maintained by the Windows Failover cluster. Ensure that the account you are using to launch the Add Azure Replica wizard has the correct permissions. You can view these permissions by looking at the Failover Clustering properties dialog box. Below is an Example:

clip_image007

3. SSMS (SQL Server Management Studio) should be launched as Administrator

Even if the account does have administrative privileges to the Windows Failover Cluster, this error can still occur if you do not launch the SQL Server Management Studio as an Administrator.

To run SSMS as Administrator Just right click and select the option “Run as administrator”

clip_image008

Ensure that you see (Administrator) in the SSMS title bar.

clip_image010

 

Issue:

SQL Add Azure Replica Wizard Fails on Step ‘Configuring Endpoints.’ With error 53 (The network path was not found) and Step ‘Provisioning Windows Azure VM with error 1722 (The RPC server is unavailable)

Some of the symptoms can be found below:

clip_image002[5]

clip_image003[5]

clip_image005[5]

clip_image006[5]

clip_image008

Cause:

During provisioning of the Azure Virtual Machine the SQL Add Azure Replica Wizard passes the –JoinDomain option during creation of the Virtual Machine.

The Virtual Machine however was unable to join the domain since it could not resolve the domain name. By default Azure will use its Internal Name Resolution for all machines in your specific VNET. The on premises domain or domain created in another VNet will not resolve correctly and will require that you provide your own DNS servers. For more information:

Azure Name Resolution (DNS)

Fix:

Create a DNS Entry for your current VNET or the VNet that will host your new VM created by the SQL Add Azure Replica Wizard. Below is some examples on how to do this.

Note: For this example, we have created a test-vpn-aar VNet as you can see below, and our on premises DNS server is 10.0.1.1

 

First Open the Manage.WindowsAzure.com portal and go to networks:

clip_image009

After you find your Virtual Network as we have done below, add a DNS Server entry to this one VNET. In this case we are adding the name OnPremDSN (Does not matter the actual name, just one you will recognize. However, the IP must be valid to one of your DNS servers. In our case it is 10.0.1.1 and will most likely be different in your environment.

clip_image011

clip_image013

clip_image015

Click Save

clip_image016

For more information:

Specifying a DNS Server in a Virtual Network Configuration File

 

Issue:

SQL Add Azure Replica Wizard fails with ‘The hosted service name is invalid’

clip_image002[7]

clip_image003[7]

clip_image005[7]

Cause:

The hosted Service Name or in this case the Cloud Service name cannot contain any ‘_’ (Underscore) characters. During the SQL Add Azure Replica Wizard attempts create a unique Cloud Service or Hosted Service name by combining the following AlwaysOn, Vnet Name, Availability Group Name, and GUID separated by a ‘-‘ (dash) character.

For example:

Virtual Network (Vnet) – Demo_Vnet

Availability Group – Demo_AG

Will result in trying to create a Cloud Service or Hosted Service with the name:

AlwaysOn-Demo_Vnet-Demo_AG-00000000000000000000000000000000.Cloudapp.Net

Fix:

There is no way to rename a virtual network or availability group. The only option will be to delete the virtual network or availability group and recreate without using the ‘_’ (underscore) character.

 

Issue:

SQL Add Azure Replica Wizard Fails on Step ‘Joining secondary replicas to availability group’

 

 

clip_image001[5]

clip_image002

clip_image004

clip_image005

clip_image007

Cause:

This error can occur if you have a listener defined for the Availability group prior to running the Add Azure Replica Wizard

Fix:

The only work around for this issue is to create the SQL AlwaysOn Listeners after you create the Availability Group. If you have a SQL Listener already defined, To avoid this issue, remove the existing Configure SQL AlwaysOn Listeners and then re-run the Add Azure Replica.

 

Issue:

VPN cannot be behind a NAT (Network Address Translation) device.

Cause:

Currently this is by design. Site to Site (S2S) is not supported when implemented behind a NAT Device (Home routers etc)

Fix:

In order to create a VPN between your On Premises network and Microsoft Azure, you must have a routing device or Microsoft Windows running RRAS (Remote Routing and Access Service) directly connected to the internet.

 

 

 

 

 

Troubleshooting Tips

 

Disable Azure Virtual Machine Cleanup

When trying to troubleshoot failures with the Add Azure Replica Wizard, you should enable the feature“Disable Azure Virtual Machine Cleanup”

When the Add Azure Replica fails it automatically cleans up the added Azure virtual machine. This makes it very difficult to troubleshoot different types of failures etc. So for troubleshooting purposes ONLY, you can disable the automatic cleanup on failure by adding the following registry key.

NOTE:

The wizard sets the key back to 0 once it has been used to disable cleanup. That means that subsequent uses of the wizard will clean up the virtual machine automatically.

Do Disable the cleanup of the Azure Environment add the following registry key prior to running the Add Azure Replica Wizard.

HKEY_CURRENT_USER\Software\Microsoft\Microsoft SQL Server\120\Tools\Client\CreateAGWizard
Value Name: CleanupDisabled
Value Type: DWORD

Set it to 1 to disable cleanup.

Connect to the newly provisioned Azure Replica virtual machine

For troubleshooting purposes it may be necessary to connect to the newly provisioned Azure virtual machine. Once the virtual machine is reported in the Azure Virtual Machine portal as 'Running' you can connect to the virtual machine one of two ways:

Use Remote Desktop to Connect to the provisioned virtual machine via its IP address

From the server you launched the SQL Server Add Azure Replica wizard, try to connect to the Azure virtual machine using Remote Desktop. To acquire the virtual machine's IP address, use Azure Management Portal and bring up the Dashboard view for your Azure virtual machine.

Execute Remote Desktop, connecting with the vm's IP address:

mstsc /v:10.0.2.x

Where 10.0.2.x is the Azure virtual machine's IP address you acquired from the Azure Management portal for that virtual machine. You can also get hits from the Add Azure Replica log.

OR

Create a remote desktop endpoint on the Azure virtual machine

By default, the newly added Azure virtual machine has no Remote Desktop endpoint created. Create the endpoint using Azure Management Portal under the virtual machine’s Endpoints link.

Click Add and choose to Add a Stand-alone Endpoint. Then use Remote Desktop to make a connection to the Azure virtual machine from your host machine.

clip_image001[7]

Review the Add Azure Replica log for information on wizard failure.

Locatethe Add Azure Replica Wizard log in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard on the on premise server you launched and ran the Add Azure Replica Wizard on.

Note:

Every time you run the Add Azure Replica Wizard, the logs will be overwritten. So it is important that you save them off as quickly as possible.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Known Issues / Fixes & Workarounds

ISSUE Virtual network drop down is empty - no Azure virtual networks are listed

SOLUTION Install the following hotfix:

FIX: Add Azure Replica wizard cannot enumerate Azure Virtual Network in SQL Server 2014

 

 

 

ISSUE Provisioning new virtual machine fails with error ‘The geo-location constraint…is invalid’ from Add Azure Replica:

The wizard fails and clicking the Error link reports ‘geo-location constraint’ error like below,

The Add Azure Replica Wizard log (found in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard reports:

Attempting to provision Windows Azure VM 'vmname' resulted in an error. (Microsoft.SQLServer.Management.HadrTasks)
Additional Information:
OperationID:1190164893f0120f97861e2cd5c47c8f, Status=Failed, Code=400, Details=The geo-location constraint specified for the hosted service is invalid.
(Microsoft.SQLServer.Management.HadrTasks)

 SOLUTION Install the following hotfix:

FIX: Add Azure Replica wizard cannot enumerate Azure Virtual Network in SQL Server 2014

 

ISSUE - ‘Checking if the cluster name resource is online’ fails with Access Denied

The validation page may fail while checking the Cluster name resource:

image

Clicking the Error link reports “Access is Denied.”

image

SOLUTION Launch SQL Server management Studio with elevated privileges

image

 

 

ISSUE Add Azure Replica Wizard Fails and reports ‘The RPC server is unavailable.’

The provisioned Azure virtual machine may not successfully join the domain, failing silently. If this occurs, while the wizard attempts to use WMI to configure ports on the virtual machine, clicking the Error links may report the following errors:

image

 

image

From the Add Azure Replica Wizard log (found in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard.

2014-11-12T19:55:40.525        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Azure Replica VM Role:Provisioning
2014-11-12T19:56:10.529        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Checking Azure Replica VM Role state...
2014-11-12T19:56:12.099        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Azure Replica VM Role:ReadyRole
2014-11-12T19:56:12.100        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Retrieving virtual machine private IP address...
2014-11-12T19:56:12.555        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      OperationId=2c307b5092792c08b8d2d15103925553, Status=Succeeded, Code=200, Details=OK
2014-11-12T19:56:12.555        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Retrieved virtual machine private IP address 10.0.16.5
2014-11-12T19:56:12.566        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Establishing remote PowerShell session. Host:10.0.16.5, Port:5986, Username:AZsqlnode\cmathews
2014-11-12T19:56:12.567        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.EnableFirewallPorts
2014-11-12T19:56:12.567        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.RetryOperation
2014-11-12T19:56:12.568        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.EnableFirewallCommand
2014-11-12T19:56:12.569        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      netsh advfirewall firewall add rule name = SQLPort dir = in protocol = tcp action = allow localport = '135,49152-65535' remoteip = any profile = Domain
2014-11-12T19:56:38.890        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Ok.
2014-11-12T19:56:38.890        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodExit       AlwaysOnWizard.EnableFirewallCommand [0ms]
2014-11-12T19:56:39.891        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodExit       AlwaysOnWizard.EnableFirewallPorts [0ms]
2014-11-12T19:56:39.891        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.InitializeDataDisksManager
2014-11-12T19:56:39.891        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.RetryOperation
2014-11-12T19:56:39.895        5344 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.WMIHelper.Connect
2014-11-12T19:56:39.895        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMIHelper connection setup for remote connection:\\10.0.16.5\root\cimv2
2014-11-12T19:56:39.896        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMIHelper establishing connection attempt #1
2014-11-12T19:57:05.507        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMI connection failed due to the following error The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
2014-11-12T19:57:15.509        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMIHelper establishing connection attempt #2
2014-11-12T19:57:41.016        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMI connection failed due to the following error The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)
2014-11-12T19:57:51.017        5344 SSMS_HadrTasks                 AlwaysOnWizard                           Information      WMIHelper establishing connection attempt #3

SOLUTION Add your on-premise DNS server IP address to your Azure virtual network

1 In your Azure Portal, click Networks in the left pane, and in the right pane, click your Azure virtual network name.

image 

2 Click the Configure link.

image

3 Add your on-premise DNS server under ‘dns servers.’

image

 

ISSUE Add Azure Replica fails with ‘The hosted service name is invalid’ if your Azure virtual network or on-premise availability group name has an '_' underscore character in it.

image

In the below output (From the Add Azure Replica Wizard log found in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard), observe that the availability group is ‘ag’, and the Azure virtual network name is ‘VPN_virtual’ which results in the failure ‘The hosted service name is invalid.’ Here the DNS name is being set to the following which includes both the availability group name ‘ag’ and the virtual network name ‘VPN_virtual’: AlwaysOn-VPN_virtual-ag-c393bede3b5348298a2f55cc260563fd.cloudapp.net.

Since Azure does not support underscore in the DNS name this fails.

2014-11-05T15:11:50.507        3388 SSMS_HadrTasks                 AlwaysOnWizard                           MethodExit       AlwaysOnWizard.ValidatePageController:CheckValidationStepsExecutionStatus [0ms]
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Availability Group: ag
...
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information          Server instance name: 2012N5AZ
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information              Virtual Machine Name: 2012N5AZ
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Image: SQL Server 2014 RTM Enterprise on Windows Server 2012 R2
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Location: East US
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Virtual Machine Size: Large (4 cores, 8 GB Memory)
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Virtual Network: VPN_virtual (East US)
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Virtual Network Subnet: Subnet-1(10.0.2.1/25)
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Cluster Name: PrimaryCluster.SQLTEST.EDU
2014-11-05T15:11:52.150        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information                  Windows Azure Hosted Service (DNS name): AlwaysOn-VPN_virtual-ag-c393bede3b5348298a2f55cc260563fd.cloudapp.net
...
2014-11-05T15:11:57.372        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work item Provisioning Windows Azure VM '2012N5AZ'. started
2014-11-05T15:11:57.388        3388 SSMS_HadrTasks                 AlwaysOnWizard                           MethodEnter      AlwaysOnWizard.CreateHostedService
2014-11-05T15:11:57.845        3388 SSMS_HadrTasks                 AlwaysOnWizard                           Information      OperationId=07a5f1f8e6fb27f482ed47e8b49ae136, Status=Failed, Code=400, Details=The hosted service name is invalid.
2014-11-05T15:11:57.845        3388 SSMS_HadrTasks                 AlwaysOnWizard                           CriticalError    [*] OperationId=07a5f1f8e6fb27f482ed47e8b49ae136, Status=Failed, Code=400, Details=The hosted service name is invalid.

SOLUTION Do not use underscore (‘_’) in the name of your availability group or Azure virtual network name.

 

ISSUE Add Azure Replica fails during Join operation is availability group has pre-defined listener.

If you run the Add Azure Replica wizard when your on premise availability group has a listener already defined, and that listener has only been created in the context of the on premise network, the Add Azure Replica Wizard may fail during the join operation.

The following appears in the Add Azure Replica wizard when join fails:

2014-12-23T13:23:13.666        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work item Adding secondary replicas to availability group 'Cluster02AG'. started
2014-12-23T13:23:14.053        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work item Adding secondary replicas to availability group 'Cluster02AG'. stopped
2014-12-23T13:23:14.053        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work Item:Adding secondary replicas to availability group 'Cluster02AG'., Details:Completed!
2014-12-23T13:23:14.073        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work item Joining secondary replicas to availability group 'Cluster02AG'. started
2014-12-23T13:23:14.773        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Information      Work item Joining secondary replicas to availability group 'Cluster02AG'. stopped
2014-12-23T13:23:14.773        3036 SSMS_HadrTasks                 AlwaysOnWizard                           Error            [*] Work Item:Joining secondary replicas to availability group 'Cluster02AG'., Details:Joining secondary replica to availability group resulted in an error.

 

 

 

The Add Azure Replica warns you in the Specify Replica dialog box that the listener will ‘not be configured.’

SQL

SOLUTION Temporarily remove the availability group listener to run the Add Azure Replica wizard.

 

ISSUE VPN Cannot be behind a NAT (network address translation) device

In order for this to work your RRAS server has to be directly connected to the Internet. You cannot have it behind a NAT’d connection. For more information see:

Tutorial: Create a Cross-Premises Virtual Network for Site-to-Site Connectivity

The VPN device cannot be located behind a network address translator (NAT) and must meet the minimum device standards. See About VPN Devices for Virtual Network for more information. In this case your RRAS Server is your VPN Device

This link below is how to setup the Gateway in an environment similar to what you are trying to do.

Connect an On-premises Network to Azure via Site to Site VPN and Extend Active Directory onto an IaaS VM DC in Azure

 

Troubleshooting Tips

Disable Azure Virtual Machine Cleanup When the Add Azure Replica fails it automatically cleans up the added Azure virtual machine. For troubleshooting purposes, you can disable cleanup by adding the following registry key. NOTE the wizard sets the key back to 0 once it has been used to disable cleanup. That means that subsequent uses of the wizard will clean up the virtual machine automatically.

HKEY_CURRENT_USER\Software\Microsoft\Microsoft SQL Server\120\Tools\Client\CreateAGWizard
Value Name: CleanupDisabled
Value Type: DWORD


Set it to 1 to disable cleanup.

Connect to the newly provisioned Azure Replica virtual machine For troubleshooting purposes it may be necessary to connect to the newly provisioned Azure virtual machine. Once the virtual machine is reported in the Azure Virtual Machine portal as 'Running' you can connect to the virtual machine one of two ways:

Use Remote Desktop to Connect to the provisioned virtual machine via its IP address From the server you launched the SQL Server Add Azure Replica wizard, try to connect to the Azure virtual machine using Remote Desktop. To acquire the virtual machine's IP address, use Azure Management Portal and bring up the Dashboard view for your Azure virtual machine.

Execute Remote Desktop, connecting with the vm's IP address:

mstsc /v:10.0.2.x

Where 10.0.2.x is the Azure virtual machine's IP address you acquired from the Azure Management portal for that virtual machine. You can also get hits from the Add Azure Replica log.

OR

Create a remote desktop endpoint on the Azure virtual machine By default, the newly added Azure virtual machine has no Remote Desktop endpoint created. Create the endpoint using Azure Management Portal under the virtual machine’s Endpoints link. Click Add and choose to Add a Stand-alone Endpoint. Then use Remote Desktop to make a connection to the Azure virtual machine from your host machine.

image

Review the Add Azure Replica log for information on wizard failure. Locate the Add Azure Replica Wizard log in <Users>\<user name>\AppData\Local\SQL Server\AddReplicaWizard on the on premise server you launched and ran the Add Azure Replica Wizard on.

 

Additional Information and Guidelines

For more information and guidelines on using the SQL Server 2014 Add Azure Replica wizard, see:

Use the Add Azure Replica Wizard (SQL Server)

For a tutorial of the Add Azure Replica wizard, that includes a walk-through using screenshots of each dialog box:

Tutorial: Add Azure Replica Wizard

Large Transaction Interrupted by Failover, Secondary Database Reports REVERTING

$
0
0

If a large transaction in an availability database on the primary replica is interrupted by a failover of the availability group, once failover has occurred, the database state on your secondary (old primary replica) reports to be in a NOT SYNCHRONIZING or REVERTING state for a long period of time. For example, consider a primary replica hosted on SQLNODE1 and a secondary replica on SQLNODE2 and failover occurs while a large ALTER INDEX transaction is running.

After failover has occurred, the secondary is reported as RECOVERING with a REVERTING synchronization state, and access to the readable database during this time will fail with message 922:

QuerySecInReverting

The large transaction rollback and secondary recovery events must complete before the readable secondary resumes synchronization and receiving changes from the new primary replica. The following lists three 'phases' that can be tracked to monitor the secondary database as it progresses towards a  SYNCHRONIZING/SYNCHRONIZED state.

Rollback the large transaction

In our example, the ALTER INDEX transaction must roll back. Even though failover has succeeded, the large transaction is still rolling back on the secondary (SQLNODE1).  This progress can be monitored in the SQL error log on the secondary (SQLNODE1) and looks like this with the message 'Remote harden of transaction 'ALTER INDEX'...failed.' punctuating completion of rollback.

2014-12-10 07:33:47.70 spid39s     Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
...
2014-12-10 07:35:12.77 spid39s     Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2014-12-10 07:35:14.77 spid39s     Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.64 spid52      Remote harden of transaction 'ALTER INDEX' (ID 0x000000000000447e 0000:000a40f4) started at Dec 10 2014  7:27AM in database 'agdb' at LSN (464:78600:3) failed.
2014-12-10 07:35:15.64 spid39s     Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.96 spid36s     The state of the local availability replica in availability group 'ag' has changed from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

Here is the recovery state as reported by sys.dm_hadr_database_replica_states during rollback of the large transaction:

QuerySecInRollback

Transition database to SECONDARY and recover database

Once the transaction has rolled back, the availability database transitions to the SECONDARY role and recovery begins - there is a large transaction that must be rolled back:

2014-12-10 07:35:15.97 spid39s     The availability group database "agdb" is changing roles from "RESOLVING" to "SECONDARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2014-12-10 07:35:15.97 spid23s     Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.97 spid34s     A connection for availability group 'ag' from availability replica 'SQLNODE1' with id  [203436EF-1D14-4F94-A9B4-8F3F25042328] to 'SQLNODE2' with id [DF629A4A-7CF9-4C5E-BCF3-3F592766F61E] has been successfully established.  This is an informational message only. No user action is required.
2014-12-10 07:35:18.34 spid23s     Starting up database 'agdb'.
2014-12-10 07:35:41.22 spid23s     Recovery of database 'agdb' (16) is 0% complete (approximately 50 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:35:42.82 spid23s     Recovery of database 'agdb' (16) is 3% complete (approximately 46 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2014-12-10 07:35:44.83 spid23s     Recovery of database 'agdb' (16) is 7% complete (approximately 44 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:36:11.61 spid23s     Recovery of database 'agdb' (16) is 28% complete (approximately 76 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:39:00.90 spid23s     Recovery of database 'agdb' (16) is 99% complete (approximately 1 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2014-12-10 07:39:00.96 spid23s     AlwaysOn Availability Groups connection with primary database established for secondary database 'agdb' on the availability replica with Replica ID: {df629a4a-7cf9-4c5e-bcf3-3f592766f61e}. This is an informational message only. No user action is required.
2014-12-10 07:39:00.96 spid23s     The recovery LSN (464:71760:1) was identified for the database with ID 16. This is an informational message only. No user action is required.
2014-12-10 07:45:16.78 spid23s     Error: 35278, Severity: 17, State: 1.
2014-12-10 07:45:16.78 spid23s     Availability database 'agdb', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.

Here is the recovery state as reported by sys.dm_hadr_database_replica_states during recovery of the large transaction on the secondary:

QuerySecInRecovery

Reviewing the performance monitor chart below, the above two rollback and recovering states are represented by the red line, which is the 'SQL Server: Database Replica: Recovery Queue' counter on the new primary, SQLNODE2:

RecoveryRevertingChart

Secondary Database in REVERTING state

After the rollback of the large transaction and the recovery of the availability database at the secondary replica, the availability database at SQLNODE1 is still not accessible nor receiving new changes from the primary. At this time, querying sys.dm_hadr_database_replica_states on SQLNODE1 will report the synchronization state description as 'REVERTING.' This is a part of the UNDO phase of recovery in which the secondary must actively request and receive pages from the primary to complete.

QuerySecInReverting

The REVERTING phase can take a long time to complete. The good news is we can track the progress of REVERTING with performance monitor using the 'SQL Server: Database Replica: Log remaining for undo' counter on the secondary (SQLNODE1). We can see an example of this in the performance monitor log above (click to enlarge) - the blue line tracks 'SQL Server: Database Replica: Log remaining for undo' tracked on SQLNODE1. Once transaction rollback and database recovery complete (red line monitored with SQL Server: Database Replica: Recovery Queue on the primary), log remaining for undo inflates and begins to drain.

In conclusion, secondary replica synchronization can be delayed following the interruption of a large transaction at the primary replica. But using the SQL Server error log and performance monitor, we monitor progress, to give some assurance that it is progressing towards completion so that the secondary is once again providing redundancy for our primary replica and read access for our reports, backups, etc.

Diagnose Unexpected Failover or Availability Group in RESOLVING State

$
0
0

AlwaysOn availability groups use Windows Cluster to 1) detect the health of the SQL Server process that hosts the primary replica, and 2) to fail over the availability group resource, if configured to do so. During the time a health issue is detected, the availability group primary replica transitions from PRIMARY to the RESOLVING role and if configured for automatic failover, will transition back out of RESOLVING to the PRIMARY role on the automatic failover partner. During the time that the availability group is in the RESOLVING role, the availability databases cannot be accessed by your application.

When your availability group replica transitions to RESOLVING, this is effectively an outage in your production environment. The purpose of this blog is to help you diagnose more common health issues detected by AlwaysOn in order to avoid their future impact. Each scenario below lists a common health issue, how to investigate its root cause and mitigation steps.

Diagnose Lease Timeout (Triggered on FAILURE_CONDITION_LEVEL 1 - 5)

The lease is a heartbeat that detects the SQL Server process health hosting the primary replica. A thread runs at priority inside the SQL Server and communicates via a Windows event with the SQL Server resource DLL-hosted process (RHS.EXE) - if that thread does not respond within the lease timeout period, the SQL Server resource DLL reports a lease timeout and the availability group resource transitions to RESOLVING state and fails over if configured to do so.

If SQL Server cannot respond within the default 20 second lease timeout period, then a lease timeout can be triggered. The following are the most common causes for a lease timeout in SQL Server and how to determine the root cause and mitigate the impact to your production environment.

The lease health check cannot be disabled it is active for all FAILURE_CONDITION_LEVEL settings, 1-5.

For more information on lease timeout, see

How It Works: SQL Server AlwaysOn Lease Timeout

Lease Timeout CAUSE - SQL Server dump diagnostic

SQL Server may detect an internal health issue like an access violation, or deadlocked schedulers, and responds by producing a mini dump file of the SQL Server process for diagnostic purposes. This ‘freezes’ SQL Server for a number of seconds while mini dump is written to disk. During this time, all threads within the SQL Server process are in a 'frozen' state, including the lease thread being monitored by the SQL Server Resource DLL, resulting in a lease timeout.

Review the SQL Server error log

Review the SQL Server error log for message 19407, which is the lease timeout message. Once you find the 19407 message, look immediately prior to the 19407 in the error log for any indication of an event that resulted in SQL Server producing a dump file. In the example output from a SQL Server error log below, SQL Server produced a dump because of a detected scheduler deadlock. During the period in which the dump file was created, SQL Server did not respond to the lease health mechanism and a lease timeout was triggered immediately following the dump diagnostic:

2014-11-02 21:21:10.59 Server      **Dump thread - spid = 0, EC = 0x0000000000000000
2014-11-02 21:21:10.59 Server      ***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\LOG\SQLDump0001.txt
2014-11-02 21:21:10.59 Server      * *******************************************************************************
2014-11-02 21:21:10.59 Server      *
2014-11-02 21:21:10.59 Server      * BEGIN STACK DUMP:
2014-11-02 21:21:10.59 Server      *   11/02/14 21:21:10 spid 1920
2014-11-02 21:21:10.59 Server      *
2014-11-02 21:21:10.59 Server      * Deadlocked Schedulers
2014-11-02 21:21:10.59 Server      *
2014-11-02 21:21:10.59 Server      * *******************************************************************************
2014-11-02 21:21:10.59 Server      * -------------------------------------------------------------------------------
2014-11-02 21:21:10.59 Server      * Short Stack Dump
2014-11-02 21:21:10.76 Server      Stack Signature for the dump is 0x00000000000002BA
2014-11-02 21:21:19.56 Server      Error: 19407, Severity: 16, State: 1.
2014-11-02 21:21:19.56 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
2014-11-02 21:21:19.56 Server      AlwaysOn: The local replica of availability group 'SQLNODE1' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
2014-11-02 21:21:19.56 Server      The state of the local availability replica in availability group 'ag' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'.  The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

Lease Timeout CAUSE - 100 % CPU utilization

If CPU utilization is very high during a period of time, this can result in a lease timeout. Monitor for high CPU utilization using Performance Monitor.

  1. Launch Performance Monitor on the server hosting the primary replica of the SQL Server availability group.
  2. In the left pane, under Data Collector Sets right click 'User Defined' and then New and then Data Collector Set.
  3. Give the data collector a name and specify 'Create from a template' and click Next.
  4. Choose System Diagnostics template and click Finish.
  5. Under the User Defined data sets in the left pane, right-click the new data collector and choose Start.
  6. Leave data collection running until the lease timeout event re-occurs, then stop the data collector, open the log in Performance Monitor and review the Processor / % Processor Time counter to see if sustained CPU utilization is detected during the time of the lease timeout.

To get an exact time for the lease timeout, review the SQL Server error log, searching for the following message:

2014-11-02 21:21:19.56 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster. 

Mitigation

Temporarily adjust the availability group lease timeout property, while researching and resolving the issue that the SQL Server dump diagnostic reported.

  1. Launch Failover Cluster Manager.
  2. In the left pane, click Roles.
  3. In the middle pane, right-click the availability group clustered resource and click Properties.
  4. Click the Properties tab and increase the LeaseTimeout in milliseconds. The maximum value is 100,000 ms.

AdjustLeaseTimeout

Diagnose Internal SQL Server health  (FAILURE_CONDITION_LEVEL 3 - 5)

AlwaysOn health detection introduces new rich health monitoring that takes into account the internal health state of SQL Server.

Internally, SQL Server executes sp_server_diagnostics which returns a rich diagnostic data set, reporting on SQL Server health components such as memory, scheduler, etc. If an error state is returned for one of these health components, the SQL Server resource DLL can respond by triggering the availability group to transition to RESOLVING role and failover if configured to do so.

The sp_server_diagnostics results are stored in the clustered diagnostic log files in the SQL Server \LOG directory with file names SRVNAME_SQLINSTANCENAME_SQLDIAG_XXX.XEL. The cluster diagnostic log contents can be viewed and filtered by opening the files in SQL Server Management Studio.

If an availability group transitioned to the RESOLVING role, because an error state was returned for one of these health components, review the component_health_results events in the cluster diagnostic log for ERROR state. By looking for these ERROR state events, we can determine what the health issue in SQL Server triggered the availability group to transition to the RESOLVING state and event failover if configured to do so.

The sp_server_diagnostics results are only utilized to monitor health when FAILURE_CONDITION_LEVEL is set to 3-5. For more information on the FAILURE_CONDITION_LEVEL settings and which will trigger availability group state change, see:

Flexible Failover Policy for Automatic Failover of an Availability Group (SQL Server)

Review Cluster Diagnostic Log (SRVNAME_SQLINSTANCE_SQLDIAG_XXX.XEL)

Launch SQL Server Management Studio and open all the cluster diagnostic logs in the SQL Server \LOG subdirectory using ‘Merge Extended Event Files’ feature:

  1. Launch SQL Server Management Studio.
  2. From the File menu choose Open and then Merge Extended Event Files. Click the Add button in the dialog that appears.
  3. Drill into the SQL Server \LOG folder and shift+click to select each of the cluster diagnostic logs and click Open.

FileOpenMergeExtendedEventFiles

Filter for component_health_result events that report ERROR state description

When certain component_health_result events report an error state, your availability group may transition to RESOLVING or automatically failover. To check for the error state in the cluster diagnostic logs:

  1. From the Extended Events menu choose Filters.
  2. In the Field column choose state_desc.
  3. Leave the Operator as '=' and for the value enter ‘error’ without the quotes.
  4. Click Ok.

EXAMPLE 1 Review the cluster diagnostic logs

Consider a scenario where the availability group FAILURE_CONDITION_LEVEL is set to 3 and a health issue transitions the availability group from PRIMARY to RESOLVING. To diagnose, open the cluster diagnostic logs on the primary replica, and by filtering we find a system component_health_result event reporting error. Now that we know the time frame, we disable the filtering and also see a availability_group_is_alive_failure event, indicating that the availability group transitioned to the RESOLVING role and may have initiated failover if configured to do so.

TooManyDumpsCDL

What SQL health issue caused the error state on the system component _health_result event? Double-click the data element in the Details pane – this will launch a new window in SSMS with the raw XML health data. We can see that the totalDumpRequeusts is an unusually large 118 dumps, and we know that AlwaysOn health detects dump volume as a health metric.

TooManyDumpsCDL2 

EXAMPLE 2 Review the cluster diagnostic logs

In the next example, a serious health issue occurs on the primary, a scheduler deadlock, but the availability group did not transition to RESOLVING and never failed over. Opening the cluster diagnostics logs, we find a query_processing component_health_result event reported state of error. Looking at the data field in the Details pane, we see HasDeadlockedSchedulersOccurred=1.

Here we do not observe a corresponding availability_group_is_alive_failure event. Why? The availability group FAILURE_CONDITION_LEVEL is set to 3, but AlwaysOn health detection will not trigger a response for this type of health issue unless the FAILURE_CONDITION_LEVEL for the availability group is set to 5.

Componenthealthresult

For more information on the different SQL Server internal health states that the FAILURE_CONDITION_LEVEL property can be configured for, see the MSDN content:

Flexible Failover Policy for Automatic Failover of an Availability Group (SQL Server)

Mitigation

Address the SQL Server health issue that has been identified using the cluster diagnostic logs..

Diagnose SQL Server does not respond within HEALTH_CHECK_TIMEOUT (FAILURE_CONDITION_LEVEL 2 - 5)

AlwaysOn health detection uses the availability group HEALTH_CHECK_TIMEOUT to define how long to wait to get a response from SQL Server. If SQL Server does not respond with the results from executing sp_server_diagnostic within the HEALTH_CHECK_TIMEOUT (default is 30 sec), then the availability group will transition to RESOLVING state and failover if configured to do so.

The availability group HEALTH_CHECK_TIMEOUT setting is only utilized to trigger a response when FAILURE_CONDITION_LEVEL is set to 2-5. For more information on the FAILURE_CONDITION_LEVEL settings and which will trigger availability group state change, see:

Flexible Failover Policy for Automatic Failover of an Availability Group (SQL Server)

Example Review Cluster log

To check if a health check timeout has been violated, review the cluster log. Launch PowerShell with elevated privileges and execute 'Get-ClusterLog.'

getclusterlog

Note the '...is not healthy with given HealthCheckTimeout' error message.

HealthCheckTimeoutLog

This is also logged in the Cluster Diagnostic Logs (SRVNAME_SQLINSTANCE_SQLDIAG_XXX.XEL) located in the SQL Server \LOG subdirectory in the event that the Cluster log is no longer available.

  1. Launch SQL Server Management Studio.
  2. From the File menu choose Open and then Merge Extended Event Files. Click the Add button in the dialog that appears.
  3. Drill into the SQL Server \LOG folder and shift+click to select each of the cluster diagnostic logs and click Open.
  4. Once the events have completed loading, click the timestamp column to order the events ascending by time.
  5. Locate the time that the availability group was found to have transitioned to the RESOLVING state, or failed over and look for 'availability_group_is_alive_failure' events. Locate any 'info_message' events nearby. Review the Data column of each info_message event for additional explanation. Finding one that states 'Availability group is not healthy given HealthCheckTimeout and FailureConditionLevel' suggests the failure could be caused by HealthCheckTimeout.
  6. Look at the additional info_message events. One that reports QueryTimeoutExpired confirms that the HEALTH_CHECK_TIMEOUT was violated.

 

HealthCheckTimeoutLogClusterDiagLogHealthCheckTimeoutLogClusterDiagLog1

Mitigation

For temporary relief adjust the availability group HEALTH_CHECK_TIMEOUT value higher. Correct the health issue in SQL Server or the system that is impacting its responsiveness to the health check connection.


AVAILABILITY GROUP AG1 SET (HEALTH_CHECK_TIMEOUT = 40000);

 

Troubleshooting REDO queue build-up (data latency issues) on AlwaysOn Readable Secondary Replicas using the WAIT_INFO Extended Event

$
0
0

PROBLEM

You have confirmed excessive build up of REDO queue on an AlwaysOn Availability Group secondary replica by one of the following methods:

You may also observe some of the following symptoms:

  • Excessive transaction log file growth on all replicas (primary & secondary)
  • Data is “missing” when querying a readable secondary replica database.
  • Long failover times

 

CAUSE

There are several potential causes for REDO queue build-up or observed data latencies when querying a readable secondary replica database.  These include (but may not be limited to):

  • Long-running active transactions
  • High network latency / low network throughput
  • Blocked REDO thread
  • Shared REDO Target

The following two articles provide excellent explanations and guidance for each of those causes:

 

This article focuses on another cause - resource contention.   In this context, resource contention means that the REDO thread is waiting on one or more resources due to other activity on the secondary replica.  These resources could be a lock (REDO blocking), disk I/O, RAM or even CPU time.

This article provides a method for using the WAIT_INFO extended event to identify the main resources on which the REDO thread is waiting.  By following the techniques described in this article, one can determine more specifically where to focus on contention issues.  This article is not intended to be a comprehensive discussion of SQL Server extended events, but how the WAIT_INFO XEvent can be used to help determine why the REDO thread may be falling behind.   For information on extended events, please refer to the references section at the end of this article.

 

WAIT_INFO event – what is it?

The  WAIT_INFO XEvent logs when a non-preemptive worker thread in SQL Server enters or leaves a “wait state”.  (note: pre-emptive threads have their own event:  WAIT_INFO_EXTERNAL).   If a thread needs to wait for an I/O to complete for example, the WAIT_INFO XEvent will fire when it begins waiting on the I/O and then again when the I/O completes and the thread resumes.

The default fields included in the WAIT_INFO event are:

 

wait_typeValue representing the type of wait.  
opcodeValue representing BEGIN (0) or END (1) depending upon if entering the wait or leaving it.
durationThe total amount of time (in milliseconds) the thread was waiting before it resumed.
signal_durationThe time (in milliseconds) between the thread being signaled that the wait condition is over and the time the thread actually resumes executing.  In other words, the time the thread spent on the scheduler’s “runnable” list waiting to get on the CPU and run again.

 

 

How can I use it?

By capturing the WAIT_INFO event when it is leaving (OPCODE = 1), the duration will indicate exactly how long the thread waited on that particular wait type.   Grouping by wait_type and aggregating the duration (and signal_duration) values will provide a breakdown of which wait types were encountered by the thread, how many times that wait was encountered, and how long the thread waited in each wait type.  In essence, it provides similar information as sys.dm_os_wait_stats in real-time for a thread or session id.

 

How to configure?

On the SQL Server hosting the secondary replica of a given availability group, each database has a single REDO thread with its own session_id.  To get a list of all of the redo threads on a given secondary replica instance (across all availability groups), issue the following query which will return the session_ids performing “REDO” -- otherwise known as  “DB STARTUP”.

 

SELECT db_name(database_id) as DBName,
    session_id FROM sys.dm_exec_requests
    WHERE command = 'DB STARTUP'

 

image

 

Once we have the session_id(s) to monitor, we can then configure an extended event session to capture the WAIT_INFO events for only the session_id(s) of interest.   This is important because the WAIT_INFO event can be fired quite extensively in a busy system.   Even when limiting to REDO specific session_ids, it can generate a lot of data very quickly.

Note– for a given secondary replica database – if the REDO thread is completely caught up and there is no new activity on the primary, this thread will eventually be closed and returned to the HADR thread pool, so it is possible you may not see any active session_id for a given database – or that the session_id can change for a given database.   However, in busy systems, where there are new transactions constantly taking place on the primary, the REDO thread will remain active and have the same session_id for extended periods of time.

 

The following script will create an extended event session that collects WAIT_INFO events for session_id 41 when they are leaving the wait (opcode = 1).   In addition to the default data collected by the WAIT_INFO event itself, additional actions are included to collect:

 

event_sequenceTo get the exact sequence of events in case order is important in further investigation
session_idIn case more than one REDO thread is being monitored, it is possible to know which event belongs to specific sessions.
database_idTo validate the session_id is still performing REDO on the database of interest (in case REDO threads were closed due to inactivity and re-started later.)
scheduler_idFor further investigation if there is evidence of scheduler contention – so that you know on which scheduler the REDO thread is executing.

 

CREATE EVENT SESSION [redo_wait_info] ON SERVER 
ADD EVENT sqlos.wait_info(
    ACTION(package0.event_sequence,
        sqlos.scheduler_id,
        sqlserver.database_id,
        sqlserver.session_id)
    WHERE ([opcode]=(1) AND 
        [sqlserver].[session_id]=(41))) 
ADD TARGET package0.event_file(
    SET filename=N'redo_wait_info')
WITH (MAX_MEMORY=4096 KB,
    EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
    MAX_DISPATCH_LATENCY=30 SECONDS,
    MAX_EVENT_SIZE=0 KB,
    MEMORY_PARTITION_MODE=NONE,
    TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO

NOTE:  By default, the above session definition will create the redo_wait_info.XEL file in your SQL Server Instance \LOG directory.  Be sure there is sufficient disk space available on this drive.

 

Once you have the extended event session defined, start it using the SQL Server Management Studio (SSMS) GUI or TSQL, collect for a short period of time and then stop the collection.   It is advisable that the first time a collection is done on a busy system, the collection period is for no more than 30 to 60 seconds.   While extended events are light weight, it is still possible to negatively affect production and generate a large amount of data very quickly.   Once you have determined the impact to your system, collections for longer intervals can be attempted.

The following TSQL will start the extended event session, wait for 30 seconds and then stop the session:

 

ALTER EVENT SESSION [redo_wait_info] ON SERVER STATE=STARTWAITFOR DELAY '00:00:30'ALTER EVENT SESSION [redo_wait_info] ON SERVER STATE=STOP
  

 

Open and Aggregate Data for Review

This section describes how to open, group and then aggregate the collected data to produce a concise presentation of the data that will facilitate quicker analysis. 

In SSMS, open the session and then click on one of the WAIT_INFO events so that the data elements can be seen on the Details pane.   Right click each of the highlighted fields and choose “Show Column in Table”

 

image

 

The details now appear as columns in the table of events:

 

image

 

Click the Grouping button to begin grouping the events – first by session_id and then by wait_type:

 

image

 

image

 

Then after grouping by session_id and wait_type, you can add Aggregation.   Click the Aggregation button:

 

image

 

Add SUM for both duration and signal_duration and sort the aggregation by duration (SUM) in descending order.

 

image

 

The resulting output shows the wait_types (with count of events in parentheses) and the aggregated sum of duration and signal duration – ordered by the highest wait_type duration in descending order:

 

image

 

 

Analyze the Captured Data

The following two test scenarios were conducted to help illustrate analyzing the collected data to determine any resource contention a REDO thread may be experiencing.

 

Scenario 1:  Blocked REDO due to large query on the secondary replica

This test was done to simulate a common scenario:  a blocked REDO thread causing the REDO queue to build up on the secondary.

We captured WAIT_INFO Xevent data for the entire time while the REDO thread was blocked and then afterward while it processed the large REDO queue that had built up. 

The testing configuration used to capture this data was the following:

  • On the primary, generate transactions to generate log blocks that will ship to the secondary replica.
  • Execute a large query on the secondary (simulating a report for example)
  • Create a blocked REDO condition by performing a DDL operation on the primary while the large query runs on the secondary.
  • Begin capturing the WAIT_INFO XEvent data on the secondary replica.
  • Maintain the REDO blocking condition on the secondary.
  • Continue generating transactions on the primary until the secondary REDO queue size reaches 1GB in size.
  • Stop the process on the primary so additional logs are not generated.
  • Kill the long running query on the secondary replica so that REDO can resume.
  • Continue collecting the extended event data until the REDO queue reaches 0.

The total extended event collection time was approximately 158.5 seconds.   The picture below shows the aggregated wait_type data during this test run.

 

image

 

As can be seen from the descending aggregated data, the primary resource on which the REDO thread waited was LCK_M_SCH_M.  This is the resource on which a blocked REDO thread will wait.  In this case, we see that the REDO thread was only blocked once for a total time of approximately 151.4 seconds.

Once the blocking situation was removed, it took approximately 7 more seconds for the 1GB REDO queue to be completely processed and caught up.   During these last 7 seconds there were only two significant resource types on which the REDO thread waited – REDO_SIGNAL  (~92%) and IO_COMPLETION (~8%).  This essentially shows that once the blocking situation was removed, there was no significant resource contention slowing down the REDO processing.

 

Scenario 2:  Blocked REDO due followed by CPU contention on the secondary replica

This test was modified to demonstrate a scenario where the REDO thread is competing for CPU resources on the secondary replica – a condition that could exist when the secondary is used for heavy read-only processing (reporting for example).

Essentially the test scenario began the same as in the first scenario.   The main difference being that before the blocking condition was removed, we added two client sessions each performing CPU intensive queries on the same scheduler as the REDO thread – thus creating a situation where the REDO thread is competing against other sessions for CPU resources.

We captured WAIT_INFO Xevent data for the entire time while the REDO thread was blocked and then afterward while it processed the large REDO queue that had built up. 

The testing configuration used to capture this data was the following:

  • On the primary, generate transactions to generate log blocks that will ship to the secondary replica.
  • Execute a large query on the secondary (simulating a report for example)
  • Create a blocked REDO condition by performing a DDL operation on the primary while the large query runs on the secondary.
  • Begin capturing the WAIT_INFO XEvent data on the secondary replica.
  • Maintain the REDO blocking condition on the secondary.
  • Continue generating transactions on the primary until the secondary REDO queue size reaches 1GB in size.
  • On the secondary, start two sessions on the same scheduler as the REDO thread.   Both of these sessions were CPU intensive – with no disk I/O.  (CPU utilization for this core went to 100% during these tests.)
  • Stop the process on the primary so additional logs are not generated.
  • Kill the long running query on the secondary replica so that REDO can resume.
  • Continue collecting the extended event data until the REDO queue reaches 0.

The total extended event collection time was approximately 266 seconds.   The picture below shows the aggregated wait_type data during this test run.

 

image

 

As in the first scenario, the largest wait_type on which the REDO thread waited was LCK_M_SCH_M – waiting 1 time for 247.5 seconds.   After the blocking situation is removed however, we see a very different picture when compared to the first scenario.   First, it took another 18.5 seconds to process the 1GB queue instead of the 7 seconds in the first scenario.   Then instead of predominately waiting on REDO_SIGNAL, we see 72% of the time (13.371/18.5) the REDO thread waited on SOS_SCHEDULER_YIELD.

There were 1668 times that the REDO thread voluntarily yielded the CPU to let other threads run – by design.   But the duration and signal_duration values we see tell us that each time the REDO thread yielded – it had to wait again to get back onto the CPU.    The two other sessions we introduced were contending for CPU time on the same scheduler.     This is normal, but we are looking for excessive wait times.  Every time the REDO thread yielded, it had to wait an average of 8ms before it got back onto the CPU  (duration 13371 / 1668 ).   While it doesn’t seem like much, in aggregate, it was over 72% of the time  - waiting on CPU.

Due to the fact that the signal_duration was equal to the duration when the REDO thread yielded:

1) REDO was predominantly waiting on CPU

2) it had to wait an average of 8ms before it got back onto the CPU (because of other running threads)

3) the entire time it was waiting to get back on the CPU -- (signal_duration = duration)  it was in the runnable list waiting to execute – meaning it wasn’t waiting on anything else other than to have its turn back on the CPU.

4) the two other sessions were the primary factor in slowing down REDO processing during this test run.

 

 

Conclusions

Using the WAIT_INFO XEvent is a simple way to capture the various waits that a REDO thread is experiencing which can give important clues in troubleshooting slow REDO operations in AlwaysOn.

 

Sample scripts

The following two sample scripts are provided (see attached SampleScripts.zip file below) to help demonstrate using WAIT_INFO XEvents to see what REDO threads are waiting on. 

The first script,  Collect_WaitInfo_For_REDO.sql  will dynamically determine which session_ids are currently performing REDO activities and create an extended event session that will filter on those specific session_ids.    It will start the session, wait one minute and then terminate the session.

The second script, Shred_Wait_Info_Xevent.sql, will take the file collected from the first script and return the top wait types by duration and by count – for each session_id that was performing REDO activity.   To execute the second script, use the CTRL-SHIFT-M option in query window of SSMS to set the required parameters.   There are two parameters – the folder location where the XEL file is stored, and the name of the file (wildcards are accepted) -- see below.

 

image

 

Sample output:

 

image

 

 

References

The following references provide more information on creating and using extended events.

How to add a TDE encrypted database to an Availability Group

$
0
0

By default, the Add Database Wizard and New Availability Group Wizard for AlwaysOn Availability Groups do not support databases that are already encrypted:  see Encrypted Databases with AlwaysOn Availability Groups (SQL Server).

If you have a database that is already encrypted, it can be added to an existing Availability Group – just not through the wizard.   This article provides the steps necessary to successfully add a TDE encrypted database to an AlwaysOn Availability Group.

This scenario has two instances – SQL1 (the AG primary replica instance)  and SQL2 (secondary replica instance)

To following pre-requisites are needed:

  1. An existing AlwaysOn Availability Group with at least one Primary and one Secondary replica instance.
  2. A TDE encrypted database on the same instance as the primary replica, online and accessible.
  3. A Database Master Key on all replica servers hosting the availability group (the primary will already have one since it has a TDE encrypted database).

 

The following actions will be done while adding the TDE encrypted database to the availability group.

  1. Verify each secondary replica instance has a Database Master Key (DMK) in the master DB (create a new one if missing)
  2. On the primary replica instance, create a backup of the certificate used to TDE encrypt the database.
  3. On each secondary replica instance, create the TDE Certificate from the certificate backed up on the primary.
  4. On the primary replica instance, create a full database backup of the TDE encrypted database.
  5. On the primary replica instance, create a transaction log backup of the TDE encrypted database.
  6. On the primary replica instance, add the TDE encrypted database to the Availability Group.
  7. On each secondary replica instance, restore the full backup (with no recovery).
  8. On each secondary replica instance, restore the transaction log backup (with no recovery).
  9. On each secondary replica instance, join the database to the availability group.

 

Step One:  Verify each secondary replica instance has a Database Master Key (DMK) in the master database – if not, create one.

To determine if an instance has a DMK, issue the following query:

USE MASTER GOSELECT * FROM
sys.symmetric_keys
WHERE name = '##MS_DatabaseMasterKey##'

If a record is returned, then a DMK exists and you do not need to create one, but if not, then one will need to be created. To create a DMK, issue the following TSQL on each replica instance that does not have a DMK already:

CREATE MASTERKEY ENCRYPTIONBY PASSWORD = 'Mhl(9Iy^4jn8hYx#e9%ThXWo*9k6o@';

Notes – PLEASE READ:

  • If you query the sys.symmetric_keys without a filter, you will notice there may also exist a “Service Master Key” named:   ##MS_ServiceMasterKey##.   The Service Master Key is the root of the SQL Server encryption hierarchy. It is generated automatically the first time it is needed to encrypt another key. By default, the Service Master Key is encrypted using the Windows data protection API and using the local machine key. The Service Master Key can only be opened by the Windows service account under which it was created or by a principal with access to both the service account name and its password.  For more information regarding the Service Master Key (SMK), please refer to the following article:  Service Master Key.  We will not need to concern ourselves with the SMK in this article.
  • If the DMK already exists and you do not know the password, that is okay as long as the service account that runs SQL Server has SA permissions and can open the key when it needs it (default behavior).   For more information refer to the reference articles at the end of this blog post.
  • You do not need to have the exact same database master key on each SQL instance.   In other words, you do not need to back up the DMK from the primary and restore it onto the secondary.   As long as each secondary has a DMK then that instance is prepared for the server certificate(s).
  • If your instances do not have DMKs and you are creating them, you do not need to have the same password on each instance.   The TSQL command, CREATE MASTER KEY, can be used on each instance independently with a separate password.   The same password can be used, but the key itself will still be different due to how our key generation is done.
  • The DMK itself is not used to encrypt databases – it is used simply to encrypt certificates and other keys in order to keep them protected.  Having different DMKs on each instance will not cause any encryption / decryption problems as a result of being different keys.
  • For more information regarding Transparent Data Encryption (TDE) & Database Master Keys (DMK) see:  Transparent Data Encryption (TDE)

 

Step Two:  On the primary replica instance, create a backup of the certificate used to TDE encrypt the database

To decrypt the TDE encrypted database on a secondary replica instance, it will have to have a copy of the certificate on the primary that is used to encrypt the TDE encrypted database.  It is possible there is more than one certificate installed on the primary replica instance.  To know which certificate to backup, run the following query (on SQL1)  and find the certificate name next to the database you wish to add to the availability group:

USE master
GO
SELECT db_name(database_id) [TDE Encrypted DB Name], c.name as CertName, encryptor_thumbprint
    FROM sys.dm_database_encryption_keys dek
    INNER JOIN sys.certificates c on dek.encryptor_thumbprint = c.thumbprint

 

It should give a result set similar to the following:

image

 

Now backup the certificate using the TSQL command BACKUP CERTIFICATE (on SQL1):

USE MASTER BACKUP CERTIFICATE [TDE_DB_EncryptionCert] TOFILE = 'TDE_DB_EncryptionCert'WITH PRIVATEKEY (FILE = 'TDE_DB_PrivateFile', ENCRYPTION BY PASSWORD = 't2OU4M01&iO0748q*m$4qpZi184WV487')

The BACKUP CERTIFICATE command will create two files.   The first file is the server certificate itself.   The second file is a “private key” file, protected by a password.  Both files and the password will be used to restore the certificate onto other instances.

When backing up the certificate, if no path is provided the certificate and private key files are saved to the default ‘data’ SQL Server database location defined for the instance.    For example, on the instance used in this example, the default data path for databases is “C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA”.

 

Note:

If the server certificate has been previously backed up and the password for the private key file is not known, there is no need to panic.   Simply create a new backup by issuing the BACKUP CERTIFICATE command and specify a new password.   The new password will work with the newly created files (the server certificate file and the private key file).

 

Step Three:  On each secondary replica instance, create the TDE Certificate from the certificate backed up on the primary

Step Two created a backup of the TDE certificate.   This step will use that backup to “re-create” or “restore” the certificate on each of the secondary replica instances.  The “backup” consists of two files – the server certificate (in this example: “TDE_DB_EncryptionCert”) and the private key file (in this example: “TDE_DB_PrivateFile”)  The second file being protected by a password.

These two files along with the password should then be used with the TSQL command  CREATE CERTIFICATE to re-create the same server certificate on the other secondary replica instances.

After copying the files to SQL2, connect to a query window on SQL2 and issue the following TSQL command:

CREATE CERTIFICATE [TDE_DB_EncryptionCert] FROMFILE = '<path_where_copied>\TDE_DB_EncryptionCert'WITH PRIVATEKEY ( FILE = '<path_where_copied>\TDE_DB_PrivateFile', DECRYPTIONBY PASSWORD = 't2OU4M01&iO0748q*m$4qpZi184WV487')

This installs the server certificate on SQL2.

Repeat as necessary if there are more secondary replica instances.  This certificate must exist on all replicas (primary and all secondary replica instances).

 

Step Four:  On the primary replica instance (SQL1), create a full database backup of the TDE encrypted database

TSQL or the SSMS GUI can both be used to create a full backup.   Example:

USE master goBACKUP DATABASE TDE_DB TO DISK = 'SOME path\TDEDB_full.bak';

For more information, please review:   How to:  Create a Full Database Backup (Transact-SQL) 

 

Step Five:  On the primary replica instance (SQL1), create a transaction log backup of the TDE encrypted database

TSQL or the SSMS GUI can both be used to create a transaction log backup.   Example:

USE master goBACKUP LOG TDE_DB TO DISK = 'SOME path\TDEDB_log.trn';

For more information, please review:   How to:  Create a Transaction Log Backup (Transact-SQL) 

 

Step Six:  On the primary replica instance (SQL1), add the TDE encrypted database to the Availability Group

 

On the primary (SQL1 in this example) for the availability group (AG_Name), issue an ALTER AVAILABILTY GROUP command:

USE master goALTER AVAILABILITY GROUP [AG_Name] ADD DATABASE [TDE_DB]

This will add the encrypted database to the primary replica for the availability group called:  AG_Name.

 

Step Seven:  On each secondary replica instance, restore the full backup (from Step Four) with no recovery

TSQL or the SSMS GUI can both be used to restore a full backup.   Please be sure to specify “NO RECOVERY” so that the transaction log backup can also be restored:    Example:

USE master goRESTORE DATABASE TDE_DB FROM DISK = 'SOME path\TDEDB_full.bak' WITH NORECOVERY;

For more information refer to the TSQL RESTORE command.

Repeat step seven for all secondary replica instances if there are more than one,

 

Step Eight:  On each secondary replica instance, restore the transaction log backup (from Step Five) with no recovery

TSQL or the SSMS GUI can both be used to restore a log backup.   Please be sure to specify “NO RECOVERY” so that the database remains in a “restoring” state and can be joined to the availability group.  Example:

USE master goRESTORE LOG TDE_DB FROM DISK = 'SOME path\TDEDB_log.trn' WITH NORECOVERY;

For more information refer to the TSQL RESTORE command.

Repeat step eight for all secondary replica instances if there are more than one.

 

Step Nine:  On each secondary replica instance, join the database to the availability group 

On the secondary (SQL2), join the database to the availability group to begin synchronization by issuing the following TSQL statement: 

USE master goALTER DATABASE TDE_DB SET HADR AVAILABILITY GROUP = [AG_Name];

Repeat step nine on all secondary replica instances

 

  

References

Viewing all 58 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>