Friday, March 30, 2012

VMware SRM vSphere replication VRMS issues

I have started to do some testing with the new vSphere replication technology that is now part of VMware Site Recovery Manager. I began working on this in our lab, and encountered what I believe is a bug that many people will run into in their production environments.

In the new version of SRM there is a new feature called vSphere replication, which gives the user the ability to replicate VM's on a per VM basis. This is very beneficial if you do not have simmilar storage arrays or array based replication software at either end. It can also come in handy if you do not desire to do per volume replication and fail over all VM's on a volume.

vSphere replication is done using a vSphere Replication Management Server (VRMS) at each site, which communicates with vCenter. At each site you will also need several vSphere Replication Servers (VR's) to facilitate the actual replication. The problems I have encountered occurs with the communication between the VRMS and the vCenter.

Problem number one - The first issue I encountered was related to the fact that the Certificate that was generated by my vCenter installation a few years ago had expired(by default the installer cert is only valid for two years). I looked up in VMware's KB how to attack this problem and came across the following article, performed the tasks in the article and generated a nice new self-signed certificate and all was good.

When I began deployment of my VRMS the OVF file uploaded perfectly and the OS started just fine. I logged into the VRMS and configured a database connection and told it where my vCenter was. When I selected "save and restart service" it told me there was something wrong with the certificate, which is pretty normal for self signed so I accepted it. I went back to the vSphere Client and clicked on the SRM button and selected the vSphere replication section. Then I went and clicked "Configure VRMS Connection on the right side, which popped this message;

I did not read the certificate error message close enough. After things didn't work I had to start investigating in the logs to find out what was wrong. Below is a link to the VMware KB article describing the issue

Disclaimer: The following procedure is likely not supported by VMware and is a documented process of my own experience to fix an issue that I have personally encountered. Please use this information at your own risk and be careful doing so in production environments.

The documentation on how these VRMS's work is a little lacking, I'm sure it's because the technology is brand new. I had to login to the linux kernel of the appliance and do some poking around. Open a console and login with root and the <password> that you specified during the OVF deployment.To save some work for you run this command after logging into the appliance

less /opt/vmware/logs/hms/hms.log

hit the end key and it'll take you to the end of the logs, which you can use the up arrow to scroll through.

We noticed that there were a variety of "MD5 hash" errors in the logs. If you go through the release notes of SRM you will find that SRM does not support use of an RSA MD5 hash algorithm certificate with vCenter, it has to be RSA sha1. In the link that I posted above for the KB article, on generating a new self signed certificate, guess what certificate hashing algorithm the commands they give you generate for your new certificate .......yup..... MD5. Below is the excerpt

openssl.exe req -new -x509 -days 3650 -md5 -nodes -key rui.key -out rui.crt -subj "fqdn_of_VC" 

What you need to do is go back and generate a new SSL certificate with the command below

openssl.exe req -new -x509 -days 3650 -sha1 -nodes -key rui.key -out rui.crt -subj "fqdn_of_VC"

Once this has been completed login to the web interface of the VRMS and select the config tab, and unregister from vcenter. Then register with vcenter again. This will popup the new certificate and select accept.

That takes care of one problem. On to problem two

I went back into the SRM console thinking I had resolved the issue and tried the "Configure VRMS Connection" link again, same error message. I went back into the same log again and discovered the below different error messages.

So if I interpreted these errors correctly the VRMS is having trouble validating the certificate chain of the vCenter, but why? I don't have the box checked that all certificates have to be trusted by a CA.

We tried a bunch of different things here from generating new certificates using new keys, to changing types of certificates, nothing worked out well. We took a break and I started digging around to see if I could find the place where the certificates were being checked against. These looked like Java error messages to me so we started looking for Java Key Stores or JKS files. It just so happens that if you look in the following path


do an ls and you should see some files


Apparently the hms-truststore.jks is where the certificates are stored that the VRMS appliance trusts. Basically what you need to do at this point is add the vCenter certificate to this keystore.

1. Download and install an SCP client like WinSCP. This will allow you to copy the rui.cert file off the vCenter to the appliance.

2. Modify the appliance to allow root remote login. VI /etc/ssh/sshd_config and edit the line seen below to say yes.

3. restart the ssh services by running this command

service sshd restart

4. connect to the appliance by IP with the WinSCP client and upload the rui.cert file from each of your vcenter servers (production and DR) into the /opt/vmware/hms/security folder. This can be found in it's default location of "C:\\ProgramData\VMware\VMware VirtualCenter\SSL".

5. Import the certificate into the hms-truststore.jks using the below commands

cd /opt/vmware/hms/security
keytool -import -trustcacerts -alias hq -file hq-rui.crt -keystore hms-truststore.jks
keytool -import -trustcacerts -alias dr -file dr-rui.crt -keystore hms-truststore.jks

it will ask for the keystore password which is "vmware" no quotes

then it will ask if you're sure you want to trust the cert, type "yes" no quotes. You should see something like this

6. Log back into the web interface of your VRMS and select the config tab and unregister, then reregister your VRMS with vcenter.

After a few minutes you should see

check the install certificate box and click ignore, after a few minutes you should see the below changes in the console

This indicates the VRMS is partnered up with the vCenter. The build number and status of "VRMS Servers are not paired" means they are talking, before it said disconnected.

I am now stuck at the point of trying to actually pair the VRMS servers when I select "Configure VRMS Connection" I get the below popup message;

 Below is a log excerpt from the SRM logs on the vCenter

2012-03-30T16:48:43.831-04:00 [11884 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 
-->    msg = "", 
--> }
2012-03-30T16:48:46.174-04:00 [08380 verbose 'SessionManager'] Logging out remote site 'site-1024' for session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 verbose 'RemoteSite'] Logging out remote site 'Site Recovery for dr-vc.ftsi.lab', session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 info 'RemoteSite'] Logged out from remote site 'Site Recovery for dr-vc.ftsi.lab', session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 info 'SessionManager'] Remote site 'site-1024' successfully logged out for session '523a6'.
2012-03-30T16:48:46.174-04:00 [08876 verbose 'SessionManager'] Removing session ID for session '52889', remote site 'site-1024'
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteSite'] Removing session ID for remote site 'Site Recovery for dr-vc.ftsi.lab', session '52889'
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteVC'] Shutting down connection
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteVC'] [PCM] Stopping...
2012-03-30T16:48:46.174-04:00 [08876 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:46.174-04:00 [08876 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:46.174-04:00 [08876 error 'RemoteVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteDR'] Shutting down connection
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteDR'] [PCM] Stopping...
2012-03-30T16:48:46.190-04:00 [08876 info 'RemoteSite'] Removed session ID for remote site 'Site Recovery for dr-vc.ftsi.lab', session '52889'
2012-03-30T16:48:46.190-04:00 [08876 info 'SessionManager'] Remove session ID for session '52889', remote site 'site-1024' is successful
2012-03-30T16:48:46.190-04:00 [09152 info 'vmomi.soapStub[17]'] Resetting stub adapter for server TCP:dr-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:46.206-04:00 [10140 verbose 'SessionManager'] Logging out user 'FTSI\administrator', session '52889'
2012-03-30T16:48:46.206-04:00 [10140 verbose 'Default'] CloseSession called for session id=52889fa2-40dd-b53d-b8b3-0556a7871eb4
2012-03-30T16:48:46.206-04:00 [10140 verbose 'SessionManager'] Closing session '52889'
2012-03-30T16:48:46.206-04:00 [10140 verbose 'LocalVC'] Shutting down connection
2012-03-30T16:48:46.206-04:00 [10140 verbose 'LocalVC'] [PCM] Stopping...
2012-03-30T16:48:46.206-04:00 [10140 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:46.206-04:00 [10140 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:46.206-04:00 [10140 error 'LocalVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:46.206-04:00 [10140 info 'SessionManager'] Closed session '52889'
2012-03-30T16:48:46.206-04:00 [06880 info 'vmomi.soapStub[19]'] Resetting stub adapter for server TCP:hq-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:48.753-04:00 [05572 verbose 'SessionManager'] Logging out user 'FTSI\administrator', session '523a6'
2012-03-30T16:48:48.753-04:00 [05572 verbose 'Default'] CloseSession called for session id=523a6f71-17e6-c0bb-820c-37ea5e081477
2012-03-30T16:48:48.753-04:00 [05572 verbose 'SessionManager'] Closing session '523a6'
2012-03-30T16:48:48.753-04:00 [05572 verbose 'LocalVC'] Shutting down connection
2012-03-30T16:48:48.753-04:00 [05572 verbose 'LocalVC'] [PCM] Stopping...
2012-03-30T16:48:48.753-04:00 [05572 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:48.753-04:00 [05572 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:48.753-04:00 [05572 error 'LocalVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:48.753-04:00 [05572 info 'SessionManager'] Closed session '523a6'
2012-03-30T16:48:48.753-04:00 [03668 info 'vmomi.soapStub[1]'] Resetting stub adapter for server TCP:hq-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:48.862-04:00 [05320 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 
-->    msg = "", 
--> }
2012-03-30T16:48:53.878-04:00 [15636 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 

VMware has not confirmed with me yet that this is in fact a bug. I will update the post when I have a resolution.

UPDATE With Resolution

I don't know if you would necessarily call this a bug or just something of an installation nuance, but here it is.

So VMware Engineering had the opportunity to review this issue and it turned out to ultimately be related to the vCenter Certificate. In addition to the MD5 hash issue described above you also need to make sure your certificate (Trusted CA Signed or Self-signed) is generated with RSA 2048 encryption. Earlier installations of vCenter are MD5 certificates with RSA 1024 encryption keys. If you encounter this issue the remediation is simple. Uninstall SRM all together, because SRM imports the vCenter certificate and uses it for site pairing and other tasks. Be sure to unregister your VRMS server before uninstalling SRM Server. Next step is to generate new certificates, I used openssl here are the commands;

Generate new key and CSR (CSR is needed if requesting a trusted CA Cert):

openssl req -out CSR.csr -new -newkey rsa:2048 -nodes -keyout rui.key

Generate the certificate files
openssl req -new -x509 -days 3650 -sha1 -nodes -key rui.key -out rui.crt -subj "/C=US/ST=NH/L=Seabrook/CN=hq-vc.ftsi.lab"

openssl pkcs12 -export -in rui.crt -inkey rui.key -name rui -passout pass:testpassword -out rui.pfx

Basically this will generate 4 files that you'll want to copy to C:\ProgramData\VMware\VMware VirtualCenter\SSL (default location)

Make sure that you save your old certificates, just in case, also when it asks for a key password the default is "testpassword" without the quotes

After you have replaced the vCenter certificates go through the SRM install as we did above and you'll be all set to go. No need to login to the VRMS's and add certificates to the keystores or anything, it just works. Do the tasks in this order;

1.            Setup order is VERY specific.

a.            Deploy Site “A” VRMS server from VI Client to Site “A”.

b.            Open web browser to IP of VRMS server “A”.

c.             Generate new SSL Certificate and install.

d.            Add settings for VC / DB / etc, BY IP address  ONLY.

e.            Hit “Save and Restart Service” when setup. 

f.             On Site “A” VI Client, open “vCenter Solutions Manager”

i.              Click VR Management.  You should be soon prompted to accept a certificate.  Do so, and click the ignore button.

g.            Open SRM.  Click vSphere Replication for the appropriate site.

i.              Wait for a certificate prompt.  Accept it.

ii.             In about a minute, you should see the VRMS server log in to SRM.

h.            Close VI Client for site “A”, open VI Client to site “B”, and repeat prior steps on Site “B”.

i.              Connect VRMS servers together. 

j.             Deploy VR servers on required systems.

k.            Add VR servers.

Tuesday, March 27, 2012

Protect yourself against VMware HA Outages

Something that I see out in the field a lot, or at least more than I should, are clusters that have all the defaults. This is a common cause of unexpected outages. HA was developed to help automate the recovery of VM's when they become unavailable. The thing we have to define is what is unavailable. Technically speaking if something is isolated on the network it is unavailable. So that's one feature VMware has built into HA. I'm not going to dig too deep on this because I wouldn't do nearly as good a job as Duncan Epping does on his blog in his HA Deepdive section;

One of the most underutilized option in HA is the ability to control what is used to define host isolation. For this first part lets assume that we're talking vSphere 4.x and earlier. By default host isolation is determined by the hosts ability to simply ping the default gateway from a management interface. Now if you look at the image below it shows the default settings if you just check the HA box on the cluster;

You notice that the default isolation response at the top of the page is set to "shutdown" the VM's on the host. The reason why this is the most common cause of an accidental outage is because the default gateway is commonly not controlled by the guys who manage the VMware environment. Reference the image below;

Imagine that the network team asked you to use the core switch as the default gateway on your ESXi hosts. Well if the network admin does a reboot of the core switch for a code upgrade or some sort of maintenance, guess what happens. The ESXi hosts all think they are isolated and start shutting down VM's. So something that was a four or five minute outage for the switch to reboot and initialize the new code now turns into an hour or more of trying to make sure all the VM's come back up and are power on in the correct order etc. This is obviously something we want to avoid.

How do we fix this? The easiest way to fix this is to identify a secondary and/or tertiary device to use  as a host isolation address. What device should we use? Well in the example above we have a few options. I would probably use the management interface on the top of rack switch, the Firewall address, or try to find something in the same rack (like a pair of load balancers or something). This will ensure that if the rack becomes isolated, due to an upstream link failure, your VM's don't just shutdown. If the whole rack is isolated there's no point in shutting down the VM's. Having these three devices would provide the best level of protection we can get. If you plan to do a reboot of the top of rack switch there's no getting around an isolation response and you should disable HA when doing such maintenance.

Implimenting HA Advanced options are quite easy, and that is how we would remedy this issue. We would use a couple HA options in this scenario

1. das.usedefaultisolationaddress - we'll set this to false, so we can define multiple isolation addresses even if we do decide to use the default gateway

2. das.isolationaddress<X> - Where <X> is the numbered entry of the isolation address, so if we had three isolation addresses we'd use das.isolationaddress1, das.isolationaddress2, das.isolationaddress3

To set these values go to the VMware Virtual Infrastructure Client, right click your HA cluster, and choose the HA section, and the advanced options button on the bottom right of the window. We will then fill it out as depicted in the image below;

This will set your cluster to use these addresses to determine host isolation. Keep in mind that if any of these devices IP's change your cluster needs to be updated. Again this applies mostly to vSphere 4.x and earlier there are a few more protection mechanisms in vSphere 5 that I will cover in a later post.


VMware VIEW PCoIP optimization

I recently attended VMware Partner Exchange, which is an event that's geared towards partners obviously. We get a lot of great information out of the event and get to interact with a lot of the guys that work on the development teams of a variety of VMware products. I had the pleasure of attending a discussion with the VMware End User Computing (EUC). This was one of the more informative sessions I attended while out there. One of the Senior Consultants, by the name of Chuck Hirstius, from that group in particular did a great job at presenting a bunch of information on PCoIP. If you don't already follow Chuck's blog which can be found at the link below;

Chuck has developed a few tools that are worth checking out for troubleshooting and managing PCoIP. They are not officially endorsed by VMware, but from my experience with the tools they work great.

First up is the PCoIP Log Viewer -

Basically this is a Java based tool that you can download to your PC or a Server in the environment.

Note that when you use this tool to monitor a PCoIP session you are pointing the tool at the machine running the PCoIP Server process, which is typically the device you are connecting to, not the thin client or end user device. 

The tool will ultimately allow you to watch a PCoIP session and see what is consuming bandwidth on that session, how much latency the session sees, etc. This gives you clues on what might need to be changed, if anything, to optimize PCoIP. Now as Chuck calls out in his blog, most people's first reaction is to say "well isn't PCoIP adaptive" the answer is yes, however by setting some of the defaults we can help the protocol adapt more fluidly. From a consultants perspective this tool has come in handy in trying to fit a bunch of PCoIP sessions down a small pipe, in the most efficient way possible.

 So to be completely honest here I haven't used the log parser yet, which Chuck lists in the post above. The concept of the log parser is to be able to take PCoIP logs and collect them so they can be read by the log viewer. The live session capabilities are fantastic, and I can't wait to dig into the parser, because I'm sure it'll provide excellent longer term troubleshooting, or trying to identify root cause of something that's already happened.

Check out Chuck's video regarding usage of the tool;

Just to give a real world example, by default audio in a PCoIP session is allowed to spike to 500kbps. To give a frame of reference a phone call over IP is typically 64k - 80k in size. Now obviously we may not want mono-voice quality, however it certainly doesn't need 500k for audio, in most use cases. In a project I recently worked on we had deployed PCoIP over a WAN link that was relativity small. we had the log viewer up and running watching a test user session with PCoIP defaults selected. We noticed that when a user erroneously picked something in an application the windows error sound played and promptly spiked consuming 350k of bandwidth to shove that audio down the link as soon as possible. One of Chucks first recommendations, which I agree with, is to drop the Max Audio Settings to 100k.

Enter Chuck's second tool, the PCoIP configuration tool (beta) -

This tool allows you to change PCoIP settings to a particular device using a graphical interface. This is incredibly handy for testing out new settings, prior to deployment. It also allows you to tweak them before rolling them out to end users very easily. Chuck has even included some defaults, so you can use his "WAN" profile as a starting point for testing.

Note: You do have to disconnect and reconnect for these settings to take effect, however a reboot of the VM is not required.

After reading the information above you're probably wondering what settings to change. There is definitely an art to getting this right. There is no chart that says if you see this happen do this and change this setting. A lot of what we deal with in VIEW, or any virtual desktop technology, is related to end user experience. There are things we can change to solve bandwidth or latency constraints but it does have an effect on the end user experience. Our goal as Engineers/Consultants is to make the setting changes not have as minimal an impact as possible. Below is a link to the VMware KB that describes what each PCoIP setting is and what it does;

VMware KB for PCoIP

Lastly we need to know how to change these settings. From an administrative perspective it is possible to set these settings in the registry or with Chucks tool, however that is not the recommended approach. This is something that can be done through Active Directory GPO's. If you don't know this already there are ADM templates available that can be imported into your AD Group Policy Management console for use in deploying PCoIP through a GPO. They are part of the default installation of a VIEW Connection Server and can be found in the C:\Program Files\VMware\VMware View\Server\extras\GroupPolicyFiles. You'll notice other ADM templates here, and all are useful, the one we want is "pcoip.adm". After you've imported the ADM template you can create a GPO and change the settings for any of the objects listed in the VMware KB article above. This will allow you to change the PCoIP settings per desktop pool if you'd like, as long as you put each desktop pool in it's own AD OU.

As discussed above there are a great many settings we can change in PCoIP. The challenge is to figure out which ones are appropriate to change. I believe that the tools discussed above will help us to make informed decisions regarding modification of these settings, and hopefully provide our users with a better end user experience.


Monday, March 26, 2012

VMware VIEW - Optimizing Windows 7

A question I often get when doing a VMware VIEW implementation is "What should I do to my Windows 7 image to optimize it for use over the LAN/WAN?". My response is usually to follow the Windows 7 optimization guide for VMware VIEW. Before we get into it to much below is a link to the document;

I must admit the first time I had used this document I skipped right over the "About this Guide" section to try and get right into the meat of things. Please don't make the same mistake I did, take a look at that section, there is invaluable information in it. They lay out a process for installation of the VIEW agents and implementation of the scripts.

When you open the link above you may be wondering where to find the script files. Many of us use an alternative browser like Google Chrome, which has a built in PDF viewer of sorts. The downside of this is when we open a guide like we're discussing here we may not see the attachments on the PDF. For this reason be sure to download the guide and open it with Adobe Reader. You should see something similar to below with an attachment window on the side bar.

You'll notice there are three scripts. The difference between them is related to VMware VIEW profile management, below is there function and how it relates to profile management;

1. CommandsDesktopsReadyForPersonaManagement.txt - Use this script for any parent image you have previously run the "CommandsNoPersonManagement.txt" script on, but would now like to use Persona Managment on.

2. CommandsNoPersonaManagement.txt - Use this script for any parent image you would like to use WITHOUT VIEW Persona Management

3. CommandsPersonaManagement.txt - Use this script for any parent image you would like to use with VIEW Persona Management

The big difference between these scripts is really two services that are either left on for using persona management, or turned off if you are not planning on using persona management.

Basically what you want to do here is save the attachments on the PDF to hard disk and rename them to *.bat. This will turn them into valid batch files for execution on your parent machines. One thing I would highly recommend is to run them in a command windows that you open(as administrator). If you just double click them the commands will run and then the command window will close. If you open a command window, browse to the directory where the batch files are located, and run the appropriate one, the script will execute, but the windows will remain open, and you can analyze which commands were successful and which were not. This allows for better troubleshooting.