The resiliency story in Lync today is great, but it is geared around providing voice resiliency only through the use of the Backup Registrar functionality and the deployment of prioritised DNS SRV records.
I see more and more organisations wanting to provide full resiliency for all Lync workloads (IM and Presence, Conferencing, the lot) as Lync becomes more of a business critical service. They want to ensure they can still provide IM and Presence in the event an entire site goes down. The supported and recommended Microsoft scenario to achieve this is the Metropolitan Site Resiliency Solution, which requires a significant amount of infrastructure to achieve full Lync resiliency in the event of site failure. It can be cost prohibitive, complex to deploy and manage and the reality is that very few organisations are setup to support this.
I’ve managed to come up with a model that I’ve architected for an organisation that cannot deploy full metro data centre resiliency (stretched VLAN, SAN replication etc) but still want some level of site resiliency for all Lync functionality.
In this post, I’ll cover how you can provide site resiliency capabilities for not just voice but also IM and Presence, Conferencing and all other Lync workloads.
Introduction
I just want to cover off a few things before I begin just to set the scene. In this post, I’ll cover specifically site level resiliency only and not data or server level resiliency. At a high level, you can achieve data level resiliency by deploying SQL clustering for your Lync databases and for server level resiliency, by deploying an Enterprise Edition Front End pool, Mediation Server pool, etc. I’m not going to go into too much detail about achieving resiliency for voice either, because this is well documented on TechNet.
This design will allow you to make Lync completely available again in a 30-60 minute timeframe once the DR activation plan is initiated. This design also assumes that you have a primary data centre and a disaster recovery site only. The DR site should not support any production users during normal operations.
Lastly, I’m writing this post from the perspective of it being a Greenfield environment where no existing Lync Server 2010 infrastructure exists.
Determining your Site Resiliency Requirements
Firstly, you need to work out what Lync functionality you provide to your business that must be back up and running in your DR site. Is it IM and Presence? Is it Group Chat as well? Archiving? This will have an impact on what infrastructure you deploy in your DR site. It’s worth asking yourself/the business these kinds of questions:
- If you have an Archiving Server deployed in your production site, does it need to be online in your DR site before your users can start using Lync again (for compliance purposes)?
If so, you will need to have an Archiving Server in your DR site. - Do you need to provide remote user access/federation in a DR scenario?
If so, you will need to deploy an Edge Server in your DR site. - Do you need to provide access to Lync Web App and Lync Simple URLs in a DR scenario?
If so, you will need to have a Reverse Proxy deployed in your DR site.
Some businesses want everything back up and running, others are satisfied with just providing Lync to internal users when DR is activated. It depends on your requirements and how extensive your business continuity plans are.
Preparing the Disaster Recovery Site
The very first thing we need to do is prepare the DR site. Make sure it is defined in Topology Builder as another central site to keep the separation of infrastructure logical. You should define topology components in the DR site as per your DR requirements but at a minimum, one Lync Server 2010 Standard Edition Server needs to be deployed.
Now here’s where it gets interesting. We need to home the Lync CMS on this pool in DR. This is required so we can access the Lync Server Control Panel in the event the primary pool is down. During my testing, I found that if the CMS was hosted on the pool in the primary site, I obviously couldn’t make any topology changes or changes to CMS using LSCP, which we need to do to make Lync available to our users.
To deploy the CMS to the Standard Edition server in your DR site, log onto the Standard Edition server in the DR site and run the Prepare first Standard Edition server step from the Lync Server 2010 Deployment Wizard. Once this is complete, you can publish your topology. For more information on setting up the Standard Edition server, check out this Technet Library article.
What if my CMS is already deployed elsewhere?
If you are implementing this DR design into an existing Lync Server 2010 environment and your CMS is already deployed on the first Lync Front End pool (Standard or Enterprise Edition) you deployed, that’s ok. To move it to your DR server, just follow Tom Pacyk’s great post on his blog on how to move the Lync CMS.
Preparing the Disaster Recovery Site for Group Chat
In order to provide site resiliency for Group Chat, we need to have a standby SQL server and a Windows server ready to go in DR. A lot of the principles from my post on database mirroring and Group Chat will apply here, in that you shouldn’t actually have a Group Chat server setup in DR. Instead, you will need a standby server ready to go to have Group Chat installed on it in the event of site failure.
Deploying and Configuring the Primary Site
Once we have our DR site sorted out with the CMS deployed to a Standard Edition server in it, it’s time to deploy our Lync Server 2010 infrastructure into our primary site.
At this stage, you can deploy your production pool, associated servers (Edge, Mediation, Archiving, Monitoring, etc) and provision your users. This step goes ahead as per any other Lync deployment.
Backing up your data
Now that we have a fully functioning Lync deployment and a DR site in place, it’s time to take the necessary steps to ensure we can restore services in the DR site in the event of site failure.
Given that we have the CMS in the DR site, the only thing we really have to worry about here is is backing up the contact lists of your users daily using dbimpexp.exe from the production Lync pool. You should move these xml files across to your DR site for safe keeping and so they are easily accessible.
If you’re using Response Groups, you should back these up using the instructions are the bottom of this TechNet article.
If you need to provide site level resiliency for Group Chat, you will need to keep a regular (at least daily) backup of the GroupChat SQL database.
Activating your Lync Site Resiliency Plan
So we’ve setup all the infrastructure, we’re backing up the data and we’re prepared for the worst. When the time comes (say your WAN link to the primary data centre fails or it all goes up in flames), you’ll need to follow these steps to restore service for Lync in your DR site:
- Firstly, open the Lync Server Control Panel (if your CMS is on the pool that failed, you won’t be able to open it, hence homing it on the DR pool).
- Navigate to the Users tab and click Find to search for all users. Alternatively, you can filter the search to only find users registered against the pool that is in the site that failed.
- Click the Action button, then select Move all users to pool.. from the drop down menu.
- When the Move Users dialog appears, select the name of the pool that has failed under Source registrar pool.
- Under the Destination registrar pool drop down menu, select the name of the Standard Edition server in your DR site.
- Now click the check box next to Force (we have to click Force because the source pool is down) and then click Move.
By selecting Force, what this will do is that the task will not attempt to reach out to the source registrar pool and migrate the users’ contact list. Rather it will just update the msRTCSIP-HomeServer attribute for the users. As a result, all users will have lost all their contact lists (as the source pool is offline). - Using dbimpexp.exe, import the contact lists you have been backing up each night onto the DR pool where the users are now homed. See my previous post on how to do this.
- If you’re using Response Groups, restore them using the procedure outlined in this TechNet article.
- Lastly, update DNS records so that the sip.domain.com A record points to the IP address of your DR Standard Edition server, or update your SRV records to point to the DR Standard Edition server.
What about Group Chat?
I haven’t forgotten about our old mate GC. 🙂 I’ve covered in some context providing DR for Group Chat before, so a lot of the steps from that post can actually be applied to this scenario.
I was reluctant to double up on content in this post so have chosen to omit specific steps, but if you’d like me to detail providing site resiliency for Group Chat in a separate post, let me know in the comments.
Expected Behaviour
When your users sign in, they will have full instant messaging, presence and conferencing abilities (audio, video, IM, desktop sharing, etc). Scheduled conferences will take a hit (as these will be tied to the failed pool), but new ad-hoc conferences will be able to be created.
Provided you have the deployed DR voice infrastructure (Mediation Servers, gateways, backup routes etc), your users will be able to make/receive calls to the PSTN also.
Conclusion
I’ve tested this scenario in a lab environment and can verify that it does provide a full restoration of Lync features. It’s not an automatic failover, but it does provide a level of site resiliency that allows you to get Lync back up and running for your users in a second site without shelling out for the infrastructure required the Metropolitan Site Resiliency design.
Give this a go in your lab, let me know how it goes for you. If you’ve got some questions, want me to go into more detail in a particular area, let me know in the comments. If you decide to go ahead with it, I hope it helps you provide a better level of service to your organisation.
Hey Justin,
thanks for the hint hosting the CMS at the DR site. I’m actually building exactly the same design for an customer right now.
What about edge server Site resilency? Do i need a separate edge pool for the DR Site/Pool or can my main Edge Pool service the DR pool?
Hi Wolfgang,
You definitely need a separate Edge pool in the DR site if you want to provide complete site resilience. You should have the DR Standard Edition server set as this Edge server’s next hop in your Lync topology.
Hi Justin,
Thanks a lot for GREAT article!
As a Wolfgang this is what I exactly looking for. I should admit this is not easy.
I am just wondering if I can use SQL server with log shipping on DR site instead of using dbimpexp.exe?
Are there any suggestions/pitfalls on fail back steps?
Hi Ilkin,
You can’t use Log Shipping with SQL because this is pretty much similar to SQL Database Mirroring. It’s an unsupported SQL HA method for Lync Server 2010.
With regards to fail back, give me a while and I’ll write a followup post to this one. 🙂
Hello again Justin,
You mentioned “…CMS deployed to a Standard Edition server in it, it’s time to deploy our Lync Server 2010 infrastructure into our primary site.”.
Does not this mean the DR site Lync standard edition server holding CMS should always be up and running? What if DR site collapses?
That’s right, your DR site will need to always be online because it has the master copy of the CMS. If your DR site fails, Lync will still keep functioning. You won’t be able to make any changes to configuration until the CMS is back up and running though.
Justin, thanks for reply. Taking into account the whole backend will be moved and hosted on DR site’s SQL express in case of DR, how many users do you think it can hold? What would be the max. number of user? Also how important would be is failing over Archiving server to DR site? Would you recommend it or leave it to be queued by standard server will continue stacking messages in MSMQ?
Hi Justin,
Just to confirm, with your design can Lync 2010 Enterprise be used as opposed to Standard. Does this change how the CMS is stored other than on a separate SQL srv or have any effect on the overall solution.
Cheers Matt.
Hi Matt,
No this doesn’t change how the CMS is stored. You can use either Standard or Enterprise Edition in your DR site for this solution.
Thanks for good question Matt and for handy answer Justin.
Can you please answer my archiving related question Justin?
Hi Guys,
I have just completed a topology that uses resiliency without the requirement of hardware load balancers, it utilises a SQL store at each site for user data, while the CMS is at the primary site usually during the event of a failure we would be happy to wait for primary site to come back before making a config change in Lync, so I will keep this article as plan B in case the primary site wont be returning.
Other then that, using prioritised SRV locations in DNS and a director pool I am hoping it will be fairly automated. All using enterprise edition, can anybody see any holes in my theory ?
At a high level, it sounds fine to me Bobby. How are you handling failover? Prioritised SRV records only work well for backup registrars (which only provides voice resiliency).
I believe the lync client will first pick up priority 0 record which is the lync director, if thats down, then lyncpool 1 , if thats down then lyncpool 2, if thats down then I hope it will look for DHCP option 120 .
We rely on our cisco UC and CUCIMOC addin for all the voice, Lync is only handling our IM, Desktop Sharing and IM conferencing. Polycoms do everything else for us .
Just be aware, the Lync client will happily redirect and sign into another pool that is listed with a lower priority SRV record, but unless the user has been moved to that pool, they will sign in in Limited Functionality Mode (no contact list).
good to know, I think that is within acceptance of our business continuity plan.
thanks for your help Justin !
Well its all configured, it works a treat.
You were correct about limited functionality, to get full functionality for a user while their home server is down I just forced a move to the server still running. Then it failed them over without issue.
So I have achieved resiliency without using any hardware load balancing and in total only using 3 servers, a Front End and Director at one site, then a standby front end at the other.
Pretty happy with the outcome.
Good to hear you’ve got it sorted Bobby. 🙂
Hi Morris,
Thanks for the interesting article.
Just want to know, as Bobby has achieved the Site resilancy,do we need Director with prioritized SRV records
You don’t need a Director pool in most scenarios, only if you need an extra layer of security for connections coming in from the Edge.
Hi Justin,
I like the method you have used here but would like to get a feel for downsides if you lose prolonged WAN connectivity to the DR CMS while still running out of the production site. I’m assuming peer to peer services will still function etc without me doing any testing.
Hi Nick,
If the CMS is unavailable, it essentially means you’ll lose the ability to conduct moves, add and/or changes to the environment. This includes anything in the Control Panel, LSMS (policy, configuration or user related) or Topology Builder (adding/removing services) won’t be changeable.
It won’t impact BAU usage of the environment. Users will still be able to do everything they normally would because the pool has a replica of the CMS. I haven’t conducted a comprehensive test of what does and does not continue working, but that’s my initial indication.
Hi Justin,
Thank you so much for great article.
I already have Lync Primary site with Front end, Edge and Reverse proxy. I want to setup Lync DR in another site with the same functionality. Can I use the same certificate SIP.domain.com in another site? Do you have fail back steps?
Thanks,
Nitin
Hi Nitin,
You can’t really use the same certificate in the other site if you want that site to be running at the same time as the primary site. To use the same certificate you would need to replicate all other settings to that site e.g. IP addresses and FQDNs. This is not recommended in my view, it’s better to have two separate sites running simultaneously and fail over when necessary.
I don’t have any fail back steps at this stage,
Hi Justin
Thank you for your post. It is well described to overcome extra effort in creating Metro Site Resilience solution
I have one doubt on this.
I am taking daily backup of my CMS database .So can’t we use like below
1) Keep the CMS DB at Primary site itself
2) Create a DR site also.
3) Once Primary goes down, Restore CMS from the backup to the DR site
4) Then continue with the procedure as discussed
-Sachin
Hi Sachin,
That may work as well if you wanted to take that route. My preference in Lync Server 2010 is to have the CMS up and running already in a DR scenario.
Hi Justin,
We have deployed LYNC 2010 in our existing environment as per below configuration.
Two Node A/P SQL 2008 Cluster.
FE pool with two Frontend Server.
Edge pool with Two Edge server.
Now we are planning to implement Metropolitan Site Resiliency in our environment.
SO we need to migrate our Existing LYNC environment to New. Could you please help
how to proceed and what are the challenges.
Great and very informative article