VMware NSX

Its VMworld show time, which means awesome stuff being announced. VMware’s Overlay Network and SDN Sagas are clearly just starting, and so VMware announced today its Network Virtualization Solution: NSX.

I haven’t been able to grasp any VMware own technical material yet, however from what I could until so far understand we are talking about the first true signs of a Nicira’s integration/incorporation into current Vmware Networking Portfolio. VMware’s Overlay Networking technology – VXLAN –  has been integrated with Nicira’s NVP platform. However, this is not mandatory, so you do get 2 additional flavors of encapsulation to choose among VXLAN: GRE and STT.

As it can be seen from NSX datasheet, key features are similar to Nicira’s NVP plataform:

• Logical Switching – Reproduce the complete L2 and L3
switching functionality in a virtual environment, decoupled
from underlying hardware
• NSX Gateway – L2 gateway for seamless connection to
physical workloads and legacy VLANs
• Logical Routing – Routing between logical switches,
providing dynamic routing within different virtual networks.
• Logical Firewall – Distributed firewall, kernel enabled line
rate performance, virtualization and identity aware, with
activity monitoring
• Logical Load Balancer – Full featured load balancer with
SSL termination.
• Logical VPN – Site-to-Site & Remote Access VPN in software
• NSX API – RESTful API for integration into any cloud
management platform”

Ivan Pepelnjak gives very interesting preview-insight on what’s under the hood. Though new, integration with Networking partners is also being leveraged. Examples are: Arista,  Brocade, Cumulus, Dell, and Juniper on the pure Networking side, and Palo Alto Networks on the security side.

Exciting time for the Networking community!


I found this Packet-Pusher’s Podcast from last week about Avaya’s Software Defined Datacenter & Fabric Connect with Paul Unbehagen really interesting. He points out some of the differences in VMware’s Overlay Network VXLAN approach against Physical Routing Switches Overlay Network’s SPB approach (which naturally Avaya is using). Some of these were different encapsulation methods – where with VXLAN the number of headers is quite more numerous – ability to support Multicasting environments (such as PIM), and most importantly, raises the central question: where do you want to control your routing and switching – the Virtual Layer or the Physical Layer. Even though I’m theoretically favorable to Virtual, arguments to keep some functions on the Physical layer still do make a lot of sense in a lot of scenarios.

The Podcast also features a lot of interesting Avaya Automation related features result of a healthy promiscuous relationship between VMware and OpenStack. Also if you want to get in more detail about SPB, Paul Unbehagen covers lots of tech details in his blog.

Overlay Virtual Networks – VXLAN

Overlay Virtual Networks (OVN) are increasingly gaining a lot of attention, whether from Virtual Networking providers, as for Physical Networking providers. Here are my notes on a specific VMware Solution: Virtual eXtensible LAN (VXLAN).


This would never be an issue if not large/huge datacenters didn’t come into place. We’re talking about some big-ass companies as well as Cloud providers, enterprises where the number of VMs scales beyond thousands. This is when the ability to scale within the Network, to rapidly change the network, and the ability to isolate different tenants is crucial. So the main motivators are:

  • First L2 communication requirement is an intransigent pusher, which drags 4k 802.1Q VLAN-tagging limitation plus L2 flooding with it
  • Ability to change configurations without burining the rest of the network, and doing it quickly – e.g. easy isolation
  • Ability to change without being “physically” constrained by Hardware limitations
  • Ability to scale large number of VMs, and being able to isolate different tenants.
  • Unlimmited Workload mobility.

These requirements demand for a change in the architecture. They demand that one is not bound to physical hardware constraints, and as such, demand an abstraction layer run by Software and which can be mutable – a virtualization layer in other words.

Network Virtualization

Professor Nick Feamster – who ended today his SDN MOOC course on Coursera – goes further and describes Network Virtualization as being the “killer app” for SDN. As a side note, here is an interesting comment from a student of this course.

Thus it is no surprise that Hyper-visor vendors were the first to push such technologies.

It is also no wonder that their approach was to treat the physical Network as a dumb Network, unaware of the Virtual Segmentation that is done within the Hyper-visor.

To conclusion, the main goal being really moving away from the dumb VLAN-aware L2 vSwitch to building a Smart Edge (Internal VM Host Networking) without having to rely on smart Datacenter Fabrics (supporting for instance EVB, etc).

Overall solutions

There is more than vendor using a OVN approach to solve the stated problems. VMware was probably one of the first Hyper-visor vendors who started with their vCloud Director Networking Infrastructure (vCDNI). MAC in MAC solution, so L2 Networks over L2. Unfortunately this wasn’t a successful attempt, and so VMware changed quickly the its solution landscape. VMware currently has two Network Virtualization solutions, namely VXLAN and more advanced Nicira NVP. Though I present these two as OVN solutions, this is actually quit an abuse, as these are quit different from each other. In this post I will restrict myself to VXLAN.

As for Microsoft, shortly after VXLAN was introduced Microsoft proposed its own Network Virtualization Solution called NVGRE. Finally Amazon uses L3 Core, which uses IP-over-IP communications.


Virtual eXtensible LAN (VXLAN) was developed by in conjunction of Cisco and VMware, and IETF launched a draft. It is supposed to have a similar encapsulation Header as in Nexus 7k OTV/LISP, allowing for Nexus 7k to act as a VXLAN Gateway.

VXLAN introduces an additional kernel Software layer between ESX vSwitches and Physical Network Card, which can either be VMware Distributed vSwitch or Cisco’s Nexus 1000v. This kernel code is able to introduce additional L2 Virtual Segments beyond the 4k 802.1Q limitation over standard IP Networks. Note that these segments run solely within the hyper-visor, which means that in order to have a physical server communicating with these VMs you will need a VXLAN Gateway.

So the VXLAN kernel is aware of Port-Groups on VM-side and intriduces a VX-segment ID (VNI), and introduces an adaptor on the NIC-side for IP communications- the VXLAN Termination Point (VTEP). VTEP has an IP address and performs encapsulation/decapsulation from L2 traffic generated by a VM and inserts VXLAN header and an UDP header plus traditional IP envelop to talk to the physical NIC. The receiving host where the destination VM resides will do the exact same reverse process.

Also note that it transforms broadcast traffic into multicast traffic for segmentation.

It is thus a transparent layer between VMs and the Network. However, since there is no centralized control plane, VXLAN used to need IP multicast on the DC core for L2 flooding. However this has changed with recent enhancements on Nexus OS.

Here’s VMware’s VXLAN Deployment Guide, and Design Guide.

Finally please do note that not everyone is pleased with VXLAN solution.


Software Defined Storage – marketing boloney or technical Breakthrough?

If you’re as suspicious as Wikipedia and I are about the new marketing buzzwords that have transcended the Networking world into Storage terminology, then this might be a post for you.

OK, so I tried a sort of reverse-engineering approach when investigating about Software Defined Storage (SDS). I tried to figure out how SDN would materialize in a Storage world, and only then did I checked what vendors are saying.

Here it goes. SDN’s architecture decouples the operational control plane from a distributed architecture where each Networking box holds its own, and centralizes it in a single device (for the sake of simplicity, I will not consider HA concerns, nor scalablity details, are those are specifics of a solution, not a model), called the SDN Controller. The goal being to make it easier in terms of Northbound interface to have customized coding, whether from an Administrator or from the application’s provider, and instantly change Networking’s behavior. Thus allowing for swift changes to take place in Networking, and populating new forwarding rules “on the fly“.

Now the way I would like to have sort of the same things map into the Storage world would be something arround the following basic characteristics:

  1. Having a centralized Control Plane (either consisting of a single controller or several), which has an Northbound API against which I can run my own scripts to customize Storage configurations and behavior. The control is not comprised by a data-plane – that stays in Storage Arrays.
  2. Applications being able to request customized Service Levels to the Control Plane, and being able to change those dinamically.
  3. Automatic orchestration and Provisioning of Storage
  4. Ability to react fast to storage changes, such as failures

Now when you talk about Networking devices, one of the advantages of decoupling Control Plane from all switchs in the Network is to have stupid or thin Switchs – and consequently cheaper ones. These minimalistic (dumb) switches would simply support populating their FIB table (whether using OpenFlow or another Protocol) by their Controller, and only a few more basic protocols related to link layer control and negotiation.

However when you try to the same with the Storage Arrays, the concept gets a little more complicated. You need to worry about data redundancy (not just the box redundancy for service), as well as performance. So the only way you can treat Storage Arrays as stupid devices is to add another layer between Arrays and Hosts, where you centralize IO – in other words, a Virtualization Layer. Otherwise, your SDS Controller would just be an orchestration layer for configuration, and we’ve already got a buzzword for that: Cloud.

By having a Virtualization layer in between you can now start mirroring data across different Arrays, locally or in a DR perspective, thus being able to control data redundancy outside your array. You also start having better control of your Storage Service level, being able to stripe a LUN accross different Tiers of Storage (SSD, 15k SAS, 10k SAS, 7,2k NL SAS) in different Arrays, transparently to the host. Please keep in mind that this is all theoratical babel so far; I’m not saying this should be implemented in production at real life scenarios. I’m justing wondering arround the concept.

So, besides having a centralized control plain, another necessity prompts: you need a virtualization layer in between your Storage Arrays and Hosts. You might (and correctly) be thinking: we already have that among various vendors, so the next question being: are we there yet? Meaning is this already an astonishing breakthrough? The answer must be no. This is the same vision of a Federated Storage environment which isn’t new at all. Take Veritas Volume Manager, or VMware VMFS.

Wikipedia states that SDS could ” include any or all of the following non-compulsory features:

  • automation with policy-driven storage provisioning – with SLAs replacing technology details
  • virtual volumes – allowing a more transparent mapping between large volumes and the VM disk images within them, to allow better performance and data management optimizations
  • commodity hardware with storage logic abstracted into a software layer
  • programability – management interfaces that span traditional storage array products, as a particular definition of separating “control plane” from “data plane”
  • abstraction of the logical storage services and capabilities from the underlying physical storage systems, including techniques such as in-band storage virtualization
  • scale-out architecture “

VMware had already pitched its Software Defined Datacenter vision in VMworld 2012, having bought Startups that help sustaining such marketing claims, such as Virsto for SDS, and Nicira for SDN.

But Hardware Vendors are also embracing the Marketing hype. NetApp announced SDS, with Data ONTAP Edge and Clustered Data ONTAP. The way I view it, both solutions consist on using a virtualization layer with common OS. One by using a simple VSA with NetApp’s WAFL OS, that presents Storage back to VMs and Servers.


The other by using a Gateway (V-Series) to virtualize third-party Arrays. This is simply virtualization, still quite faraway a truly SDS concept.

IBM announcing the same, with a VSA.

HP is also leveraging its LeftHand VSA for Block-Storage, as well as a new VSA announced for Backup to Disk – StoreOnce VM. Again, same drill.

Now EMC looks to me (in terms of marketing at least) as the Storage Player who got the concept best. It was announced that EMC will launch soon its Software Defined Storage controller – ViPR. Here is its “Datasheet“. 

To conclusion: in my oppinion SDS is still far far far away (technically speaking) from the SDN developments, so as usual, renew your ACLs for this new marketing hype.

VMware SDS (?) – Virsto VSA Architecture

Though still having limited technical resources available for an indepth deep dive, the available resources already provide an overall picture.

Virsto positions itself as a Software Defined Storage (SDS) product, using Storage virtualization tecnology. Well, besides being 100% Software, I see a huge gap from the Networking Software-Defined concept to the SDS side. I do imagine VMware’s marketing pushing them to step up the SDS marketing, with their Software Defined Datacenter vision. However Virsto failed to make me disagree on Wikipedia’s SDS definition.

Living the SDS story aside, there’s still a story to tell. Not surprisingly, their main selling point is not different from the usual Storage vendors: performance. They constructed their technical marketing focused on solving what they call the “VM I/O blender” issue (well choosen term, have to recognize that). The VM I/O blender effect derivates from having of several VMs in concurrency disputing IOPS from the same Array, thus creating a large randomized pattern. (By the way: this randomized pattern is one of the reasons why you should always lookout for Storage Vendors claiming large Cache-Hit percentages, and always double-check the theoretical IO capacity on the disk side.)

VM IO Blender Virsto

How do you Architect Virsto’s Solution

Virsto uses a distributed Clustered Virtual Storage Appliance (VSA), with a parallel computing architecture (up to 32 nodes on the same VMware cluster latest from what I could check). You install a Virtual Storage Appliance (VSA) on each host, (and from what I could understand) serving as a target for you physical storage Arrays. Virsto VSA then presents storage back to your VMs just like NFS datastores, allowing it to control all VM IO while supporting HA, FT, VM and storage vMotion, DRS. Here is an overview of the architecture.

Virsto VMware Architecture

As usual, one of the VSA’s serves as a Master, and the other VSA on different nodes (i.e. VMware Hosts) as slaves. Each VSA has its own Log file (the vLog file), that serves as the central piece in the whole architecture, as we will see next. They support heterogeneous Block Storage, so yes, you are able to virtualize different Vendor Arrays with it (although in practice that might not be the best practical solution at all).

Virsto Architecture

You can use different Arrays from different providers, and tier those volumes up to four (4) storage tiers (such as SSD, 15k rpm, 10k rpm, and 7.2k rpm). After aggregation on the VSA, Virsto vDisks can be presented in many formats, namely: iSCSI LUN, vmdk, VHD, vVol, or KVM file. So yes, Virsto VSA also works on top of Hyper-V the same way. Here is an interesting Hyper-V demo.

So to solve the VM I/O blender effect, every Write IO from each VM is captured and logged into a local log file (Virsto vLog). Virsto aknowledges the write back to the VM immediately after the commit on the log file, and then asynchronously writes in a sequential manner back to your Storage Array. Having data written in a sequential manner has the advantage of having blocks organized in a contiguous fashion, thus enhancing performance on reads. Their claim is a better 20-30% read performance.

So as a performance sizing best practice, they do recommend using SSD storage for the vLog. Not mandatory, but serious performance booster. Yes please.

The claim is that through the logging architecture they are able to do both thin provisioning and Snapshots without any performance degredation.

As a result, they compare Virsto vmdks to Thick vmdks in terms of performance, and to linked-clones in terms of space efficiency. If you check Virsto’s demo on their site, you’ll see that they claim having better performance than thick eagor-zeroed vmdks, even with Thin Provisioned Virsto-vmdk disks. Note that all Virsto-vmdk are thin.

Finally how do they guarantee HA of the vLog? Well from what I could understand, one major difference from VMware’s own VSA is that Virsto’s VSA will simply leverage already existing shared storage from your SAN. So it will not create the shared storage environment, from what I understood it stands on Shared Storage. When I say it, please take only into consideration the vLog. I see no requirements to have the VSA as well on top of Shared Storage.

Data consistency is achieved by redoing the log file of the VSA that was on the failing Host. VMware’s VSA on the contrary, allows you to aggregate local non-shared disks of each of your VMware Hosts, and present them back via NFS to your VMs or Physical Hosts. It does this while still providing HA (of of the main purposes) by coping data across Clustered nodes in a “Network RAID”, providing continuous operations even on Host failure.

Some doubts that I was not able to clarify:

  • What are the minimum vHardware recommendations for each VSA?
  • What is the expected Virsto VSA’s performance hit on the Host?
  • What is the limit maximum recommended number of VSAs clustered together?

VMware’s VSA vs Virsto VSA

So as side note, I do not think it is fair to claim to near death of VMware’s own VSA, even if Virsto is indeed able to sustain all its technical claims. Virsto has a different architecture, and can only be positioned in more complexed IT environments  where you already have different Array technologies and struggle for performance vs cost.

VMware’s VSA is positioned for SMB customers with limited number of Hosts, and is mainly intended to provide a Shared Storage environment without shared storage. So different stories, different ends.

Setup for VMware, HP Virtual Connect, and HP IRF all together

Trying to setup your ESXi hosts networking with HP Virtual Connect (VC) to work with your HP’s IRF-Clustered ToR switchs in Active-Active mode?


This setup improves performance, as you are able to load balance traffic from several VMs across both NICs, across both Virtual Connect modules, and across both ToR switchs, while in the meanwhile gaining more resilience. Convergence on your network will be much faster, either if a VC uplink fails, a whole VC module fails, or a whole ToR fails.

Here’s a couple of docs that might help you get started. First take a look on the IRF config manual of your corresponding HP’s ToR switches. Just as an example, here’s the config guide for HP’s 58Xo series. Here are some important things you should consider when setting up an IRF cluster:

  1. If you have more than two switches, choose the cluster topology.  HP IRF allows you to setup a daisy-chained cluster, or in ring-topology. Naturally the second one is the recommended for HA reasons.
  2. Choose which ports to assign to the cluster. IRF does not need specific Stacking modules and interfaces to build the cluster. For the majority of HP switchs, the only requirement is to use 10GbE ports, so you can flexibly choose whether to use cheaper CX4 connections, or native SFP+ ports (or modules) with copper Direct Attach Cables or optical transcevers. Your choice. You can assign one or more physical ports into one single IRF port. Note that the maximum ports you can assign to an IRF port varies according to the switch model.
  3. Specify IRF unique Domain for that cluster, and take into consideration that nodes will reboot in the meanwhile during the process, so this should be an offline intervention.
  4. Prevent from Split-brain Condition potential events. IRF functions by electing a single switch in the cluster as the master node (which controls Management and Control plane, while data plain works Active-Active on all nodes for performance). So if by any reason stacking links were to fail, slave nodes by default initiate new master election, which could lead into a split-brain condition. There are several mechanisms to prevent this front happening (called Multi-Active Detection (MAD) protection), such as LACP MAD or BFD MAD.

Next hop is your VC. Whether you have a Flex-Fabric or a Flex-10, in terms of Ethernet configuration everything’s pretty much the same. You setup two logically identical Shared Uplink Sets (SUS) – note that VLAN naming needs to be different though inside the VC. Next you assign uplinks from one VC module to one SUS and uplinks from another different VC module to the other SUS. It is very important that you get this right, as VC modules are not capable of doing IRF withemselves (yet).

If you have doubts on how VC works, well nothing like the VC Cookbook. As with any other documentation that I provide with direct links, I strongly recommend that you google them up just to make sure you use the latest version available online.

If you click on the figure above with the physical diagram you will be redirected to HP’s Virtual Connect and IRF Integration guide (where I got the figure in the first place).

Last hop is your VMware hosts. Though kind of outdated document (as Vmware already launched new ESX releases after 4.0), for configuration purposes the “Deploying a VMware vSphere HA Cluster with HP Virtual Connect FlexFabric” guide already allows you to get started. What you should take into consideration is that you will not be able to use the Distributed vSwitch environment with new vmware’s LACP capabilities, which I’ll cover in just a while. This guide provides indications on how to configure in a standard vSwitch or distributed setup on your ESX boxes (without using VMware’s new dynamic LAG feature).

Note that I am skipping some steps on the actual configuration, such as the Profile configuration on the VC that you need to apply to the ESX hosts. Manuals are always your best friends in such cases, and the provided ones should provide for the majority of steps.

You should make sure that you use redundant Flex-NICs if you want to configure your vSwitchs with redundant uplinks for HA reasons. Flex-NICs are present themselves to the OS (in this case ESXi kernel) as several distinct NICs, when in reallity these are simply Virtual NICs of a single physical port. Usually each half-height HP Blade comes with a single card (the LOM), whcih has two physical ports, each of which can virtualize (up to date) four virtual NICs. So just make sure that you don’t choose randomly which Flex-NIC is used for each vSwitch.

Now comes the tricky part. So you should not that you are configuring Active-Active NICs for one vSwitch.


However, in reality these NICs are not touching the same device. Each of these NICs is mapped statically to a different Virtual Connect module (port 01 to VC module 01, and port 02 to VC module 02). Therefore in a Virtual Connect environment you should note that you cannot choose the load balancing algorithm on the NICs based on IP-hash. Though you can choose any other algorithmn, HP’s recommendation is using Virtual Port ID, as you can see on the guide (this is the little piece of information missing on VC Cookbook on Active-Active config with Vmware).

Finally, so why the heck can’t you use IP-hash, nor VMware’s new LACP capabilities? Well  IP-Hash is intended when using link aggregation mechanisms to a single switch (or Switch cluster like IRF, that behaves and is seen as a single switch). As a matter a fact, IP-SRC-DST is the only algorithm supported for NIC teaming on the ESX box as well for the Switch when using port trunks/EtherChannels:

” […]

  • Supported switch Aggregation algorithm: IP-SRC-DST (short for IP-Source-Destination)
  • Supported Virtual Switch NIC Teaming mode: IP HASH
  • Note: The only load balancing option for vSwitches or vDistributed Switches that can be used with EtherChannel is IP HASH:
    • Do not use beacon probing with IP HASH load balancing.
    • Do not configure standby or unused uplinks with IP HASH load balancing.
    • VMware only supports one EtherChannel per vSwitch or vNetwork Distributed Switch (vDS).

Moreover, VMware explains in another article: ” What does this IP hash configuration has to do in this Link aggregation setup? IP hash algorithm is used to decide which packets get sent over which physical NICs of the link aggregation group (LAG). For example, if you have two NICs in the LAG then the IP hash output of Source/Destination IP and TCP port fields (in the packet) will decide if NIC1 or NIC2 will be used to send the traffic. Thus, we are relying on the variations in the source/destination IP address field of the packets and the hashing algorithm to provide us better distribution of the packets and hence better utilization of the links.

It is clear that if you don’t have any variations in the packet headers you won’t get better distribution. An example of this would be storage access to an nfs server from the virtual machines. On the other hand if you have web servers as virtual machines, you might get better distribution across the LAG. Understanding the workloads or type of traffic in your environment is an important factor to consider while you use link aggregation approach either through static or through LACP feature.

So given the fact that you are connecting your NICs to different VC modules (in a HP c7000 Enclosure), the only way you will be able to have active-active teaming is with static etherchannel, and without IP-Hash.

None-the-less please note that if you were configuring for instance a non-blade server, such as a HP Dl360, and connecting it to an IRF ToR cluster, then in that case you would be able to use vDS with LACP feature, and thus use IP-Hashing. Or if you were using a HP c3000 Enclosure, which is usually positioned for smaller enviroments. So it does not make that much sense to be using a c3000 and in the meanwhile having the required vSphere Enterprise Plus licensing.

That’s it, I suppose. Hopefully this overview should already get your ESX farm up-and-running.


Disclamer: note that these are my own notes. HP is not responsable for any of the content here provided.