The resiliency story in Lync today is great, but it is geared around providing voice resiliency only through the use of the Backup Registrar functionality and the deployment of prioritised DNS SRV records.
I see more and more organisations wanting to provide full resiliency for all Lync workloads (IM and Presence, Conferencing, the lot) as Lync becomes more of a business critical service. They want to ensure they can still provide IM and Presence in the event an entire site goes down. The supported and recommended Microsoft scenario to achieve this is the Metropolitan Site Resiliency Solution, which requires a significant amount of infrastructure to achieve full Lync resiliency in the event of site failure. It can be cost prohibitive, complex to deploy and manage and the reality is that very few organisations are setup to support this.
I’ve managed to come up with a model that I’ve architected for an organisation that cannot deploy full metro data centre resiliency (stretched VLAN, SAN replication etc) but still want some level of site resiliency for all Lync functionality.
In this post, I’ll cover how you can provide site resiliency capabilities for not just voice but also IM and Presence, Conferencing and all other Lync workloads.
Introduction
I just want to cover off a few things before I begin just to set the scene. In this post, I’ll cover specifically site level resiliency only and not data or server level resiliency. At a high level, you can achieve data level resiliency by deploying SQL clustering for your Lync databases and for server level resiliency, by deploying an Enterprise Edition Front End pool, Mediation Server pool, etc. I’m not going to go into too much detail about achieving resiliency for voice either, because this is well documented on TechNet.
This design will allow you to make Lync completely available again in a 30-60 minute timeframe once the DR activation plan is initiated. This design also assumes that you have a primary data centre and a disaster recovery site only. The DR site should not support any production users during normal operations.
Lastly, I’m writing this post from the perspective of it being a Greenfield environment where no existing Lync Server 2010 infrastructure exists.
Determining your Site Resiliency Requirements
Firstly, you need to work out what Lync functionality you provide to your business that must be back up and running in your DR site. Is it IM and Presence? Is it Group Chat as well? Archiving? This will have an impact on what infrastructure you deploy in your DR site. It’s worth asking yourself/the business these kinds of questions:
- If you have an Archiving Server deployed in your production site, does it need to be online in your DR site before your users can start using Lync again (for compliance purposes)?
If so, you will need to have an Archiving Server in your DR site.
- Do you need to provide remote user access/federation in a DR scenario?
If so, you will need to deploy an Edge Server in your DR site.
- Do you need to provide access to Lync Web App and Lync Simple URLs in a DR scenario?
If so, you will need to have a Reverse Proxy deployed in your DR site.
Some businesses want everything back up and running, others are satisfied with just providing Lync to internal users when DR is activated. It depends on your requirements and how extensive your business continuity plans are.
Preparing the Disaster Recovery Site
The very first thing we need to do is prepare the DR site. Make sure it is defined in Topology Builder as another central site to keep the separation of infrastructure logical. You should define topology components in the DR site as per your DR requirements but at a minimum, one Lync Server 2010 Standard Edition Server needs to be deployed.
Now here’s where it gets interesting. We need to home the Lync CMS on this pool in DR. This is required so we can access the Lync Server Control Panel in the event the primary pool is down. During my testing, I found that if the CMS was hosted on the pool in the primary site, I obviously couldn’t make any topology changes or changes to CMS using LSCP, which we need to do to make Lync available to our users.
To deploy the CMS to the Standard Edition server in your DR site, log onto the Standard Edition server in the DR site and run the Prepare first Standard Edition server step from the Lync Server 2010 Deployment Wizard. Once this is complete, you can publish your topology. For more information on setting up the Standard Edition server, check out this Technet Library article.
What if my CMS is already deployed elsewhere?
If you are implementing this DR design into an existing Lync Server 2010 environment and your CMS is already deployed on the first Lync Front End pool (Standard or Enterprise Edition) you deployed, that’s ok. To move it to your DR server, just follow Tom Pacyk’s great post on his blog on how to move the Lync CMS.
Preparing the Disaster Recovery Site for Group Chat
In order to provide site resiliency for Group Chat, we need to have a standby SQL server and a Windows server ready to go in DR. A lot of the principles from my post on database mirroring and Group Chat will apply here, in that you shouldn’t actually have a Group Chat server setup in DR. Instead, you will need a standby server ready to go to have Group Chat installed on it in the event of site failure.
Deploying and Configuring the Primary Site
Once we have our DR site sorted out with the CMS deployed to a Standard Edition server in it, it’s time to deploy our Lync Server 2010 infrastructure into our primary site.
At this stage, you can deploy your production pool, associated servers (Edge, Mediation, Archiving, Monitoring, etc) and provision your users. This step goes ahead as per any other Lync deployment.
Backing up your data
Now that we have a fully functioning Lync deployment and a DR site in place, it’s time to take the necessary steps to ensure we can restore services in the DR site in the event of site failure.
Given that we have the CMS in the DR site, the only thing we really have to worry about here is is backing up the contact lists of your users daily using dbimpexp.exe from the production Lync pool. You should move these xml files across to your DR site for safe keeping and so they are easily accessible.
If you’re using Response Groups, you should back these up using the instructions are the bottom of this TechNet article.
If you need to provide site level resiliency for Group Chat, you will need to keep a regular (at least daily) backup of the GroupChat SQL database.
Activating your Lync Site Resiliency Plan
So we’ve setup all the infrastructure, we’re backing up the data and we’re prepared for the worst. When the time comes (say your WAN link to the primary data centre fails or it all goes up in flames), you’ll need to follow these steps to restore service for Lync in your DR site:
- Firstly, open the Lync Server Control Panel (if your CMS is on the pool that failed, you won’t be able to open it, hence homing it on the DR pool).
- Navigate to the Users tab and click Find to search for all users. Alternatively, you can filter the search to only find users registered against the pool that is in the site that failed.
- Click the Action button, then select Move all users to pool.. from the drop down menu.
- When the Move Users dialog appears, select the name of the pool that has failed under Source registrar pool.
- Under the Destination registrar pool drop down menu, select the name of the Standard Edition server in your DR site.
- Now click the check box next to Force (we have to click Force because the source pool is down) and then click Move.
By selecting Force, what this will do is that the task will not attempt to reach out to the source registrar pool and migrate the users’ contact list. Rather it will just update the msRTCSIP-HomeServer attribute for the users. As a result, all users will have lost all their contact lists (as the source pool is offline).
- Using dbimpexp.exe, import the contact lists you have been backing up each night onto the DR pool where the users are now homed. See my previous post on how to do this.
- If you’re using Response Groups, restore them using the procedure outlined in this TechNet article.
- Lastly, update DNS records so that the sip.domain.com A record points to the IP address of your DR Standard Edition server, or update your SRV records to point to the DR Standard Edition server.
What about Group Chat?
I haven’t forgotten about our old mate GC. 🙂 I’ve covered in some context providing DR for Group Chat before, so a lot of the steps from that post can actually be applied to this scenario.
I was reluctant to double up on content in this post so have chosen to omit specific steps, but if you’d like me to detail providing site resiliency for Group Chat in a separate post, let me know in the comments.
Expected Behaviour
When your users sign in, they will have full instant messaging, presence and conferencing abilities (audio, video, IM, desktop sharing, etc). Scheduled conferences will take a hit (as these will be tied to the failed pool), but new ad-hoc conferences will be able to be created.
Provided you have the deployed DR voice infrastructure (Mediation Servers, gateways, backup routes etc), your users will be able to make/receive calls to the PSTN also.
Conclusion
I’ve tested this scenario in a lab environment and can verify that it does provide a full restoration of Lync features. It’s not an automatic failover, but it does provide a level of site resiliency that allows you to get Lync back up and running for your users in a second site without shelling out for the infrastructure required the Metropolitan Site Resiliency design.
Give this a go in your lab, let me know how it goes for you. If you’ve got some questions, want me to go into more detail in a particular area, let me know in the comments. If you decide to go ahead with it, I hope it helps you provide a better level of service to your organisation.