Friday, March 30, 2012

VMware SRM vSphere replication VRMS issues

I have started to do some testing with the new vSphere replication technology that is now part of VMware Site Recovery Manager. I began working on this in our lab, and encountered what I believe is a bug that many people will run into in their production environments.

In the new version of SRM there is a new feature called vSphere replication, which gives the user the ability to replicate VM's on a per VM basis. This is very beneficial if you do not have simmilar storage arrays or array based replication software at either end. It can also come in handy if you do not desire to do per volume replication and fail over all VM's on a volume.

vSphere replication is done using a vSphere Replication Management Server (VRMS) at each site, which communicates with vCenter. At each site you will also need several vSphere Replication Servers (VR's) to facilitate the actual replication. The problems I have encountered occurs with the communication between the VRMS and the vCenter.

Problem number one - The first issue I encountered was related to the fact that the Certificate that was generated by my vCenter installation a few years ago had expired(by default the installer cert is only valid for two years). I looked up in VMware's KB how to attack this problem and came across the following article, performed the tasks in the article and generated a nice new self-signed certificate and all was good.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1009092

When I began deployment of my VRMS the OVF file uploaded perfectly and the OS started just fine. I logged into the VRMS and configured a database connection and told it where my vCenter was. When I selected "save and restart service" it told me there was something wrong with the certificate, which is pretty normal for self signed so I accepted it. I went back to the vSphere Client and clicked on the SRM button and selected the vSphere replication section. Then I went and clicked "Configure VRMS Connection on the right side, which popped this message;


I did not read the certificate error message close enough. After things didn't work I had to start investigating in the logs to find out what was wrong. Below is a link to the VMware KB article describing the issue

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2013087

****************************************************************************
Disclaimer: The following procedure is likely not supported by VMware and is a documented process of my own experience to fix an issue that I have personally encountered. Please use this information at your own risk and be careful doing so in production environments.
****************************************************************************

The documentation on how these VRMS's work is a little lacking, I'm sure it's because the technology is brand new. I had to login to the linux kernel of the appliance and do some poking around. Open a console and login with root and the <password> that you specified during the OVF deployment.To save some work for you run this command after logging into the appliance

less /opt/vmware/logs/hms/hms.log

hit the end key and it'll take you to the end of the logs, which you can use the up arrow to scroll through.

We noticed that there were a variety of "MD5 hash" errors in the logs. If you go through the release notes of SRM you will find that SRM does not support use of an RSA MD5 hash algorithm certificate with vCenter, it has to be RSA sha1. In the link that I posted above for the KB article, on generating a new self signed certificate, guess what certificate hashing algorithm the commands they give you generate for your new certificate .......yup..... MD5. Below is the excerpt

openssl.exe req -new -x509 -days 3650 -md5 -nodes -key rui.key -out rui.crt -subj "fqdn_of_VC" 

What you need to do is go back and generate a new SSL certificate with the command below

openssl.exe req -new -x509 -days 3650 -sha1 -nodes -key rui.key -out rui.crt -subj "fqdn_of_VC"

Once this has been completed login to the web interface of the VRMS and select the config tab, and unregister from vcenter. Then register with vcenter again. This will popup the new certificate and select accept.

That takes care of one problem. On to problem two

I went back into the SRM console thinking I had resolved the issue and tried the "Configure VRMS Connection" link again, same error message. I went back into the same log again and discovered the below different error messages.


So if I interpreted these errors correctly the VRMS is having trouble validating the certificate chain of the vCenter, but why? I don't have the box checked that all certificates have to be trusted by a CA.


We tried a bunch of different things here from generating new certificates using new keys, to changing types of certificates, nothing worked out well. We took a break and I started digging around to see if I could find the place where the certificates were being checked against. These looked like Java error messages to me so we started looking for Java Key Stores or JKS files. It just so happens that if you look in the following path

/opt/vmware/hms/security

do an ls and you should see some files

hms-keystore.jks
hms-truststore.jks

Apparently the hms-truststore.jks is where the certificates are stored that the VRMS appliance trusts. Basically what you need to do at this point is add the vCenter certificate to this keystore.

1. Download and install an SCP client like WinSCP. This will allow you to copy the rui.cert file off the vCenter to the appliance.

2. Modify the appliance to allow root remote login. VI /etc/ssh/sshd_config and edit the line seen below to say yes.


3. restart the ssh services by running this command

service sshd restart

4. connect to the appliance by IP with the WinSCP client and upload the rui.cert file from each of your vcenter servers (production and DR) into the /opt/vmware/hms/security folder. This can be found in it's default location of "C:\\ProgramData\VMware\VMware VirtualCenter\SSL".

5. Import the certificate into the hms-truststore.jks using the below commands

cd /opt/vmware/hms/security
keytool -import -trustcacerts -alias hq -file hq-rui.crt -keystore hms-truststore.jks
keytool -import -trustcacerts -alias dr -file dr-rui.crt -keystore hms-truststore.jks

it will ask for the keystore password which is "vmware" no quotes

then it will ask if you're sure you want to trust the cert, type "yes" no quotes. You should see something like this



6. Log back into the web interface of your VRMS and select the config tab and unregister, then reregister your VRMS with vcenter.

After a few minutes you should see


check the install certificate box and click ignore, after a few minutes you should see the below changes in the console



This indicates the VRMS is partnered up with the vCenter. The build number and status of "VRMS Servers are not paired" means they are talking, before it said disconnected.

I am now stuck at the point of trying to actually pair the VRMS servers when I select "Configure VRMS Connection" I get the below popup message;


 Below is a log excerpt from the SRM logs on the vCenter


2012-03-30T16:48:43.831-04:00 [11884 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 
-->    msg = "", 
--> }
2012-03-30T16:48:46.174-04:00 [08380 verbose 'SessionManager'] Logging out remote site 'site-1024' for session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 verbose 'RemoteSite'] Logging out remote site 'Site Recovery for dr-vc.ftsi.lab', session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 info 'RemoteSite'] Logged out from remote site 'Site Recovery for dr-vc.ftsi.lab', session '523a6'
2012-03-30T16:48:46.174-04:00 [08380 info 'SessionManager'] Remote site 'site-1024' successfully logged out for session '523a6'.
2012-03-30T16:48:46.174-04:00 [08876 verbose 'SessionManager'] Removing session ID for session '52889', remote site 'site-1024'
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteSite'] Removing session ID for remote site 'Site Recovery for dr-vc.ftsi.lab', session '52889'
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteVC'] Shutting down connection
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteVC'] [PCM] Stopping...
2012-03-30T16:48:46.174-04:00 [08876 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:46.174-04:00 [08876 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:46.174-04:00 [08876 error 'RemoteVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteDR'] Shutting down connection
2012-03-30T16:48:46.174-04:00 [08876 verbose 'RemoteDR'] [PCM] Stopping...
2012-03-30T16:48:46.190-04:00 [08876 info 'RemoteSite'] Removed session ID for remote site 'Site Recovery for dr-vc.ftsi.lab', session '52889'
2012-03-30T16:48:46.190-04:00 [08876 info 'SessionManager'] Remove session ID for session '52889', remote site 'site-1024' is successful
2012-03-30T16:48:46.190-04:00 [09152 info 'vmomi.soapStub[17]'] Resetting stub adapter for server TCP:dr-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:46.206-04:00 [10140 verbose 'SessionManager'] Logging out user 'FTSI\administrator', session '52889'
2012-03-30T16:48:46.206-04:00 [10140 verbose 'Default'] CloseSession called for session id=52889fa2-40dd-b53d-b8b3-0556a7871eb4
2012-03-30T16:48:46.206-04:00 [10140 verbose 'SessionManager'] Closing session '52889'
2012-03-30T16:48:46.206-04:00 [10140 verbose 'LocalVC'] Shutting down connection
2012-03-30T16:48:46.206-04:00 [10140 verbose 'LocalVC'] [PCM] Stopping...
2012-03-30T16:48:46.206-04:00 [10140 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:46.206-04:00 [10140 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:46.206-04:00 [10140 error 'LocalVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:46.206-04:00 [10140 info 'SessionManager'] Closed session '52889'
2012-03-30T16:48:46.206-04:00 [06880 info 'vmomi.soapStub[19]'] Resetting stub adapter for server TCP:hq-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:48.753-04:00 [05572 verbose 'SessionManager'] Logging out user 'FTSI\administrator', session '523a6'
2012-03-30T16:48:48.753-04:00 [05572 verbose 'Default'] CloseSession called for session id=523a6f71-17e6-c0bb-820c-37ea5e081477
2012-03-30T16:48:48.753-04:00 [05572 verbose 'SessionManager'] Closing session '523a6'
2012-03-30T16:48:48.753-04:00 [05572 verbose 'LocalVC'] Shutting down connection
2012-03-30T16:48:48.753-04:00 [05572 verbose 'LocalVC'] [PCM] Stopping...
2012-03-30T16:48:48.753-04:00 [05572 warning 'VixVcDomain'] VIX connection already logged out
2012-03-30T16:48:48.753-04:00 [05572 info 'RoleRegistry'] Shutting down...
2012-03-30T16:48:48.753-04:00 [05572 error 'LocalVC'] [PM] Cannot unregister callback for filter token '1' because PropertyMonitor is stopped
2012-03-30T16:48:48.753-04:00 [05572 info 'SessionManager'] Closed session '523a6'
2012-03-30T16:48:48.753-04:00 [03668 info 'vmomi.soapStub[1]'] Resetting stub adapter for server TCP:hq-vc.ftsi.lab:80 : Closed
2012-03-30T16:48:48.862-04:00 [05320 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 
-->    msg = "", 
--> }
2012-03-30T16:48:53.878-04:00 [15636 verbose 'HbrProvider'] Dr::Replication::HbrProviderImpl::SetRemoteInfoFailed: Unable to get remote HMS server info, error=
--> (dr.hbrProvider.fault.HmsServersNotPaired) {
-->    dynamicType = <unset>, 
-->    faultCause = (vmodl.MethodFault) null, 
-->    localHmsUuid = "d2f5461c-59e4-43fa-b6a0-8fa5f198e9a3", 
-->    localHmsName = "HQ-VRMS", 
-->    remoteHmsUuid = "5521a563-205b-4a77-adff-920ead24d7ad", 
-->    remoteHmsName = "DR-VRMS", 



VMware has not confirmed with me yet that this is in fact a bug. I will update the post when I have a resolution.

*****************************************************************************
UPDATE With Resolution
*****************************************************************************

I don't know if you would necessarily call this a bug or just something of an installation nuance, but here it is.

So VMware Engineering had the opportunity to review this issue and it turned out to ultimately be related to the vCenter Certificate. In addition to the MD5 hash issue described above you also need to make sure your certificate (Trusted CA Signed or Self-signed) is generated with RSA 2048 encryption. Earlier installations of vCenter are MD5 certificates with RSA 1024 encryption keys. If you encounter this issue the remediation is simple. Uninstall SRM all together, because SRM imports the vCenter certificate and uses it for site pairing and other tasks. Be sure to unregister your VRMS server before uninstalling SRM Server. Next step is to generate new certificates, I used openssl here are the commands;

Generate new key and CSR (CSR is needed if requesting a trusted CA Cert):

openssl req -out CSR.csr -new -newkey rsa:2048 -nodes -keyout rui.key

Generate the certificate files
openssl req -new -x509 -days 3650 -sha1 -nodes -key rui.key -out rui.crt -subj "/C=US/ST=NH/L=Seabrook/CN=hq-vc.ftsi.lab"

openssl pkcs12 -export -in rui.crt -inkey rui.key -name rui -passout pass:testpassword -out rui.pfx


Basically this will generate 4 files that you'll want to copy to C:\ProgramData\VMware\VMware VirtualCenter\SSL (default location)

Make sure that you save your old certificates, just in case, also when it asks for a key password the default is "testpassword" without the quotes

After you have replaced the vCenter certificates go through the SRM install as we did above and you'll be all set to go. No need to login to the VRMS's and add certificates to the keystores or anything, it just works. Do the tasks in this order;


1.            Setup order is VERY specific.

a.            Deploy Site “A” VRMS server from VI Client to Site “A”.

b.            Open web browser to IP of VRMS server “A”.

c.             Generate new SSL Certificate and install.

d.            Add settings for VC / DB / etc, BY IP address  ONLY.

e.            Hit “Save and Restart Service” when setup. 

f.             On Site “A” VI Client, open “vCenter Solutions Manager”

i.              Click VR Management.  You should be soon prompted to accept a certificate.  Do so, and click the ignore button.

g.            Open SRM.  Click vSphere Replication for the appropriate site.

i.              Wait for a certificate prompt.  Accept it.

ii.             In about a minute, you should see the VRMS server log in to SRM.

h.            Close VI Client for site “A”, open VI Client to site “B”, and repeat prior steps on Site “B”.

i.              Connect VRMS servers together. 

j.             Deploy VR servers on required systems.

k.            Add VR servers.


3 comments:

  1. Excellent!!!, this should be in the vmware knowledge base.

    ReplyDelete
    Replies
    1. Thanks Lukas, be sure to tell your friends :)

      Delete
  2. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete