Real world experience with VSAN 2.1 on vSphere 6.0U1

Real world experience with VSAN 2.1 on vSphere 6.0U1

I thought I’d share some observations about testing my first real-deal-non-simulated, non-nested, non-HOL, non-bogus production VSAN 2.1 cluster on vSphere 6 U1.   This stuff may be old news for a VSAN specialist, but I’ve been heavily focused on NSX this year and frankly haven’t had a chance until now to really implement VSAN in production (one of the perils of specialization I suppose).

Also, I’ve generally found VSAN to be a particular challenge for lab simulation / nested setups.  Without actual hardware, its difficult to appreciate the performance aspect, and that’s a BIG part of VSAN.

As an architect, I don’t recommend people use *anything* for business critical applications until I have personally put the technology through its paces and really understand the operational experience firsthand.   My general feeling now that I’ve gotten my grubby paws on it is that the new version of VSAN shipping with vSphere 6.0 U1 is ready for prime time WRT important VMs.  Its always been impressive performance wise, but they really have nailed the stability concerns I had in this new release.

The items below represent questions I had about VSAN 2.1 going into this, the answers to which would have been great to know beforehand.  All testing was done with a loadsim running on multiple VMs per host set to simulate mid-day peak load.

  1. Migrating from VSS to VDS after building VSAN (including vCenter and the VSAN port groups and everything).  I needed to migrate from VSS to VDS *after* the VSAN was built and vCenter+a bunch of other stuff was running on it.  I tried to just generically do it all in one vCenter job – this caused VSAN to spaz out for about 2 minutes, then it settled down.  VMs were frozen during this period.  So don’t do that.  BUT if you’re careful about the order I believe it can be done non disruptively.
  1. Power failures are well tolerated – I tried pretty hard to get VSAN to go into a state that would force some sort of CLI based recovery.  I couldn’t get that to happen.  Most of the power failure simulations I did were totally non disruptive (other than VMs doing HA reboots obviously).   I was not able to “corrupt” VSAN or piss it off enough to have to ssh in to fix it or anything like that.
  1. Individual disk failures less so, but still very good – Pulling a single randomly selected hot swap disk out of each host in the cluster seemed to work fine.  I was definitely able to make VSAN angry by pulling and replacing multiple disks faster than it could respond, but this was expected – the test was to see *when* it broke, not if it would at all.  Impressively, even though I was doing stuff I shouldn’t have been,  VSAN didn’t break outright.  I had some VMs grind to a halt at one point, but overall the datastore was solid.
  1. SCP/BASH is weird with VSAN 2.0 – Maybe this is documented and I didn’t see it, but under /vmfs/volumes/<vsandatastore>, you can’t do all the normal stuff (i.e. run bash scripts to copy things around) – you’ll get a lot of “function not implemented” type errors.  Maybe there are workarounds, but be cautious if like me you rely heavily on being able to SCP stuff directly to/from datastores to deploy new environments.
  1. You can build the VSAN datastore from esxcli – *before* vCenter exists!  I normally won’t deploy anything in prod without a dedicated management cluster, but that was not an option in this case.  It had to be one big cluster with the management VMs running alongside capacity VMs.  We were able to just ssh into the hosts and build the entire thing, then install vCenter on top of it!  Very cool!  I should probably read the instructions first next time.  I went into this thinking we’d have to do the vCenter tango for sure.
  1. Little vCenter dependency in general – From what I can tell, WRT VSAN, vCenter is just a convenience for doing the policies and healthcheck stuff.  One of the tests I did was pull the power on the host that was the VSAN master and was also running vCenter.  Worked fine, vCenter did an HA reboot and nothing else seemed to be impacted.  The VSAN datastore isn’t tied to HA/DRS cluster membership, or DVS membership, or even vCenter membership.  Its on a totally other layer.  Which is great, because it makes building/upgrading/troubleshooting the environment much simpler than if it had a straight dependency.
  1. NSX and VSAN seem to coexist well – I didn’t do a ton of testing on this aspect, but from what I can tell the two things run fine in the same environment.  Didn’t run into any roadblocks on this side of things.  A few builds I did last year when VSAN still had a hard and fast L2 adjacency requirement created some obnoxious architectural constraints that are no longer an issue.   FYI you cannot run the VSAN multicast traffic over VXLAN, or at least I couldn’t get it to work even though multicast in general worked fine within the same VXLAN.  I know this is unsupported, and could create some gnarly circular dependencies, but hey why not give it a shot?

Next Up

We are doing our black box stress testing later this week to marry VSAN 2.1 performance with business KPIs.  In this case, the KPIs will likely be things such as simultaneous virtual desktop sessions per VSAN cluster of a given config.  I’ll post some of this when I collate the results.


Leave a Reply

Your email address will not be published. Required fields are marked *