Software Defined Storage – marketing boloney or technical Breakthrough?

If you’re as suspicious as Wikipedia and I are about the new marketing buzzwords that have transcended the Networking world into Storage terminology, then this might be a post for you.

OK, so I tried a sort of reverse-engineering approach when investigating about Software Defined Storage (SDS). I tried to figure out how SDN would materialize in a Storage world, and only then did I checked what vendors are saying.

Here it goes. SDN’s architecture decouples the operational control plane from a distributed architecture where each Networking box holds its own, and centralizes it in a single device (for the sake of simplicity, I will not consider HA concerns, nor scalablity details, are those are specifics of a solution, not a model), called the SDN Controller. The goal being to make it easier in terms of Northbound interface to have customized coding, whether from an Administrator or from the application’s provider, and instantly change Networking’s behavior. Thus allowing for swift changes to take place in Networking, and populating new forwarding rules “on the fly“.

Now the way I would like to have sort of the same things map into the Storage world would be something arround the following basic characteristics:

  1. Having a centralized Control Plane (either consisting of a single controller or several), which has an Northbound API against which I can run my own scripts to customize Storage configurations and behavior. The control is not comprised by a data-plane – that stays in Storage Arrays.
  2. Applications being able to request customized Service Levels to the Control Plane, and being able to change those dinamically.
  3. Automatic orchestration and Provisioning of Storage
  4. Ability to react fast to storage changes, such as failures

Now when you talk about Networking devices, one of the advantages of decoupling Control Plane from all switchs in the Network is to have stupid or thin Switchs – and consequently cheaper ones. These minimalistic (dumb) switches would simply support populating their FIB table (whether using OpenFlow or another Protocol) by their Controller, and only a few more basic protocols related to link layer control and negotiation.

However when you try to the same with the Storage Arrays, the concept gets a little more complicated. You need to worry about data redundancy (not just the box redundancy for service), as well as performance. So the only way you can treat Storage Arrays as stupid devices is to add another layer between Arrays and Hosts, where you centralize IO – in other words, a Virtualization Layer. Otherwise, your SDS Controller would just be an orchestration layer for configuration, and we’ve already got a buzzword for that: Cloud.

By having a Virtualization layer in between you can now start mirroring data across different Arrays, locally or in a DR perspective, thus being able to control data redundancy outside your array. You also start having better control of your Storage Service level, being able to stripe a LUN accross different Tiers of Storage (SSD, 15k SAS, 10k SAS, 7,2k NL SAS) in different Arrays, transparently to the host. Please keep in mind that this is all theoratical babel so far; I’m not saying this should be implemented in production at real life scenarios. I’m justing wondering arround the concept.

So, besides having a centralized control plain, another necessity prompts: you need a virtualization layer in between your Storage Arrays and Hosts. You might (and correctly) be thinking: we already have that among various vendors, so the next question being: are we there yet? Meaning is this already an astonishing breakthrough? The answer must be no. This is the same vision of a Federated Storage environment which isn’t new at all. Take Veritas Volume Manager, or VMware VMFS.

Wikipedia states that SDS could ” include any or all of the following non-compulsory features:

  • automation with policy-driven storage provisioning – with SLAs replacing technology details
  • virtual volumes – allowing a more transparent mapping between large volumes and the VM disk images within them, to allow better performance and data management optimizations
  • commodity hardware with storage logic abstracted into a software layer
  • programability – management interfaces that span traditional storage array products, as a particular definition of separating “control plane” from “data plane”
  • abstraction of the logical storage services and capabilities from the underlying physical storage systems, including techniques such as in-band storage virtualization
  • scale-out architecture “

VMware had already pitched its Software Defined Datacenter vision in VMworld 2012, having bought Startups that help sustaining such marketing claims, such as Virsto for SDS, and Nicira for SDN.

But Hardware Vendors are also embracing the Marketing hype. NetApp announced SDS, with Data ONTAP Edge and Clustered Data ONTAP. The way I view it, both solutions consist on using a virtualization layer with common OS. One by using a simple VSA with NetApp’s WAFL OS, that presents Storage back to VMs and Servers.


The other by using a Gateway (V-Series) to virtualize third-party Arrays. This is simply virtualization, still quite faraway a truly SDS concept.

IBM announcing the same, with a VSA.

HP is also leveraging its LeftHand VSA for Block-Storage, as well as a new VSA announced for Backup to Disk – StoreOnce VM. Again, same drill.

Now EMC looks to me (in terms of marketing at least) as the Storage Player who got the concept best. It was announced that EMC will launch soon its Software Defined Storage controller – ViPR. Here is its “Datasheet“. 

To conclusion: in my oppinion SDS is still far far far away (technically speaking) from the SDN developments, so as usual, renew your ACLs for this new marketing hype.

Basic BGP Concepts

Here is a very short introduction to Border Gatway Protocol (BGP) – or Bloody Good Protocol as some like to call it. BGP is a routing Protocol, which is used mainly for:

  • Sharing prefixes (networks) between ISPs, thus enabling the Internet to scale;
  • Multi-home an organization to several ISPs (whereby Internet prefixes from ISPs are learned, and its own networks are advertised)
  • Scaling internally in very large organizations

BGP is an Exterior Gateway Protocol (EGP), which differentiates from IGPs – such as RIP, OSPF, IS-IS, EIGRP – mainly for:

  • Uses TCP (port 179) for transport ensuring reliable delivery of BGP messages between peers (Routers)
  • Can scale hundreds of thousands of Routes (without crashing like IGPs would)
  • Peers are manually configured – there is no automatic peer discovery, all peers must be manually added
  • Besides prefix, mask and metric, BGP carries several additional attributes. Though being a major advantage against other protocols, attributes also have the disadvantage of making BGP more complex to configure
  • BGP is “political in nature” when it comes to finding best paths, meaning Best Paths can be flexibly changed (using attributes). IGPs on the contrary have fixed best path algorithm, namely Short Path First, and chose it by metric. It is much harder to manually influence the Best path choosen in an IGP (for instance you can change the cost of an interface, but it is not possible to set a different cost on the same interface for different destinations), whereas in BGP it is much easier.
  • BGP may converge more slowly when failures occur, whereas IGPs usually converge faster
  • Since BGP is not a link state protocol, BGP does not share every prefix in its BGP table with every peer. Instead, it only shares the best routes with peers (even though it might know several paths to the same destination).

BGP carries several attributes with each prefix. Since there is no space in the routing table to hold all those attributes, BGP has its own table where it stores prefixes with all its attributes. However, BGP table is not used directly to route IP packets. Instead BGP places only the best prefixes in the routing table with administrative distance of 255 (so that ii the prefix is learned by both BGP and IGP, the IGP route will always be preferred), while maintaining all prefixes in its table. This allows for redundancy as well as load balancing capabilities.

Attributes are what makes BGP so flexible and thus interesting. Most are optional, only the first three are mandatory. Here is a list of BGPs attributes:

  • Origin (mandatory) – indicates how the attribute was originally created into BGP; in other words, it indicates if a certain prefix was imported from another Routing protocol or static routes, or if it was specifically originated by the administrator manually, or even if it was originated by the EGP (obsolete)
  • AS Path  (mandatory) – allows eBGP to be loop free. Subsequent AS can distinguish how the route was created, be being able to see a ordered list of AS between local (first AS Path number) and destination prefix (last AS Path number).
  • Next hop  (mandatory) -is an IP address, that should be used for packets destined to a certain prefix. It allows a peer to deduce the interface to use to send packets to the appropriate border router.
  • Multi-Exit Descriminator (MED) (optional) – used to influence inbound traffic from a neighboring AS. It only influences direct neighbor peers. It is a low-power attribute (comes late in the decision process), but can be useful in organizations that are multi-homed to the same ISP. The lowest MED value wins
  • Local preference (optional) – used to influences outbound traffic, also in organizations multi-homed to the same ISP. Local preference value is only advertised within iBGP peers, and is not advertised to a neighboring AS. Prefix with highest local preference value wins decision process, with the advantage of being able to load balance traffic while maintaining redundancy.
  • Atomic Aggregate (optional) – Used in prefix summarization to warn throughout the Internet that a certain prefix is an aggregate
  • Aggregator (optional) -Also used in prefix summarization, shared throughout the Internet and it includes the router-id and AS number of the router that performed the summarization
  • AS 4 Path (optional) – used to support the longer 32 bit AS numbers through AS that support only 16 bit AS numbers
  • Communities  (optional) – it is a special marking for policies usually deployed by ISPs. It allows to group prefixes together in order to give special and common treatment for a set of prefixes
  • Extended communities  (optional) – as the name indicates, it extends the Communities attribute length, allowing for additional provider offerings such as MPLS VPNs, etc.
  • Originator ID (optional) – attribute intended for iBGP environments where Route Reflectors (RR) are used. It prevents from misconfiguration of RR, by ignoring  duplicate prefixes that a client has advertised and received back.
  • Cluster List (optional) – helps preventing loops when using multiple clusters of Route Reflectors (in redundant HA mode). Cluster List operates much like AS Path does, collecting the sequence of Cluster IDs through which the update has traversed. This attribute is also exclusive for iBGP environments, and will not traverse to eBGP peers

Finally the BGP decision process hierarchy, from highest to lowest. BGP will chose the best path considering the many attributes associated with the multiple copies of one prefix, instead of the cost or metric like IGPs. Since the attributes can be changed by the administrator, the best path configuration is indeed based on the preferences of the administrator. BGP will also maintain several paths in its table, so that when a prefix is no longer available (for example due to a link failure which BGP monitors through its keep-alive messages in the TCP session) a new best path is populated in the routing table.

Whenever a tie, move to next lower level in order to choose the best path chosen to populate the routing table:

  • Next hop reachable – a route must exist to next hop IP address, and will not be considered if not reachable
  • Preferred Value – the highest preferred value will be chosen. It is a proprietary parameter and local to the router.
  • Local preference – the highest local preference value will be chosen. The policy is local to the AS
  • Locally originated – prefix originated by the local router
  • Shortest AS Path – shared throughout between local and destination.
  • Origin – “i” preferred over “?”
  • Multi-Exit Discriminator – influences neighboring AS only
  • External BGP versus internal BGP – eBGP preferred over iBGP
  • Router-ID – the lowest value will be chosen. It is the final tiebraker

How to do rough Storage Array IOPS estimates

This post is dedicated to trying to get around the mystery factor that Cache and Storage Array algorithms have, and helping you calculating how many disks you should have inside your Storage Array to produce a stable average number of disk operations per second – IOPS.

Face it, it is very hard to estimate performance – more specifically IOPS. Though throughput may be high in sequential patterns, Storage Array face a different challenge when it comes to random IOPS. And from my personal experience, Array-vendor Sales people tend to be over-optimistic when it comes to the maximum IOPS their Array can produce. And even though their Array might actually be able to achieve a certain high maximum value of IOPS with 8KB blocks, that does not mean you will in your environment.

Why IOPS matter

A lot of factors can affect your Storage Array performance. The first typical factor are the very random traffic and high output patterns of databases. It is no wonder  this is the usual first use-case for SSD. Online Transaction Processing (OLTP) workloads, which by having verified writes (write and read back the data) double IOPS, and since it has a high speed demand, can be source of stress for Arrays.

Server Virtualization is also a big contender, which produces the “IO blender effect“. Finally Exchange is also a mainstream contender for high IOPS, though the architecture since Microsoft version 2010 changed the paradigm for storing data in Arrays.

These are just some simple and common of the many examples where IOPS can be even more critical than throughput. This is where your disk count can become a critical factor, and to the rescue when that terabyte of Storage Array cache is lost and desperately crying out for help.


So here are some very simplistic grocery-style type of math, which can be very useful to quickly estimate how disks you need in that new EMC/NetApp/Hitachi/HP/… Array.

First of all IOPS variate according to the disk technology you use. So in terms of Back-end these are the average numbers I consider:

  • SSD – 2500 IOPS
  • 15k HDD – 200 IOPS
  • 10k HDD – 150 IOPS
  • 7.2k HDD – 75 IOPS

Total Front-End IOPS = C + B , where:

C stands for total number of successful Cache Hit IOPS on reads, and B for total IOPS you can extract from your disk backend (reads + writes). Their formula is:

C = %Cache-hit * %Read-pattern

B = (Theoretical Raw Sum of Back-end Disk IOPS) * %Read-pattern + (Theoretical Raw Sum of Back-end Disk IOPS)/(RAID-factor) * %Write-pattern

C is the big exclamation mark on every array. It depends essentially on the amount of Cache the Array has, on the efficiency of its Algorithms and code, and in some cases such as in EMC VNX the usage of helping technologies such as FAST Cache. This is where your biggest margin of error lies. I personally use values between 10% up to 50% maximum efficiency, which is quite a big difference, I know.

As for B, you have to take into consideration the penalty that RAID introduces:

  • RAID 10 has a 2 IO front-end penalty: for every write operation you will have one additional Write for data copy. Thus you have to halve all Back-End IOPS, in order to have the true Front-End IOPS
  • RAID 5 has a 4 IO back-end penalty: for every write operation, you have 2 reads (read old data + parity) plus 2 writes (new data and parity)
  • RAID 6 has a 6 IO Back-ned penalty: for every write operation, you have 3 reads (read old data + parity) plus 3 writes (new data and parity)

Say I was offered a mid-range array with two Controllers, and I want to have about 20.000 of IOPS out of 15k SAS HDD. How many disks would I need?

First the assumptions:

  • About 30% of average cache-hit success on reads (which means 70% of reads will go Back-end)
  • Using RAID 5
  • Using 15k HDD, so about 200 IOPS per disk
  • 60/40 % of Read/Write pattern

Out of these 20.000 total Front-End IOPS, Cache-hit percentage will be:

C = 20.000* %Read * %Cache-hit = 20.000 * 0,6 * 0,3 = 3.600

Theoretical Raw Sum of Back-end Disk IOPS = N * 200

Thus, to arrive at the total number of disks needed:

20.000 – 3.600 = (N*200)*0,6 + (N*200/4) *0,4

Thus N = 117.14 Disks.

So about 118 disks.


Hope this helped.

iBGP basics

BGP peers that belong to the same Autonomous System (AS) are considered iBGP peers. Why is this important? Because iBGP behavior is different from eBGP, even though the commands might be quite similar. Here is a summary of the differences:

  • AS-Path is only pre-pended at eBGP border, not in iBGP. iBGP has thus its own loop prevention mechanism, which consists of prohibiting the advertising of iBGP prefixes from other iBGP peers amongst themselves
  • BGP sets the TTL in its messages’ IP packet equal to one (1), so that it is restricted to one hop. In iBGP TTL is set to the maximum value of 255, as connections between iBGP peers may be multiple hops away
  • BGP attributes are not changed within iBGP communications. Next-hop remains the eBGP next-hop. Moreover, Local preference attribute will only remain within iBGP peers and will not traverse to neighboring AS.
  • Route selection process will prefer an eBGP route over iBGP route when AS-Path is the same.

It is the IGP’s responsibility to find a path to the loopback’s interface. However, if the IGP process fails, iBGP will also fail. So the first troubleshooting task should be confirming if both routers can reach the peer’s loopback interface.

Since best practices recommend using loopback interfaces for establishing connection with iBGP peers redundancy reasons, remember to advertise each peer’s loopback interface into the IGP.

On the other hand loop prevention mechanism is different as well between eBGP and iBGP. eBGP uses AS Path attribute to guarantee loop free behavior, where iBGP uses almost sacred rules in terms of prevention:

  • Prefixes received from an eBGP peer will always be advertised to all other BGP peers (in other words, directly connected)
  • Prefixes received from iBGP peers are only sent to eBGP peers.

So by not advertising iBGP prefixes from other iBGP peers amongst themselves, iBGP is able to prevent loops. However the problem being is that such functioning mechanism requires full-mesh topology between iBGP peers. If not in full-mesh, a second hop away iBGP peer router may not receive certain eBGP routes, breaking reachability.

However, since full-mesh topology requirement makes it quite hard to scale, another mechanism was created, which explicitly breaks the stated above rules about iBGP loop prevention mechanism: Route Reflectors (RR). RR allow one iBGP Router to send prefixes to another iBGP peer (client). So the RR iBGP client does not need to be fully meshed, it only needs to maintain a session with the RR Router.

RR iBGP routers should mirror prefixes to its iBGP RR client with BGP attributes unchanged. Note that you can configure 1:N relationship in terms of iBGP RR router and several iBGP RR clients. Also noteworthy in terms of scalability is the fact that a RR client can also be a RR router for other RR clients, and of the same routes. However the more you cascade, the bigger the risk, since RR Routers represent a Single Point of Failure (SPOF). It is always a best practice to configure a redundant Route reflector, acting as a cluster (so that updates are not duplicated by the reflectors).

Moreover, you can have an hibrid iBGP AS, where you use Route Reflectors for some non fully meshed iBGP routers, and fully meshed another set of routers that follow traditional iBGP rules.

Finally Originator ID attribute and Cluster List attribute can prevent from misconfigurations when using RR.

EMC acquires ScaleIO

EMC acquired the Storage Startup ScaleIO for $200M-$300M.

ScaleIO is a Palo Alto based Startup that competes with Amazon AWS, more specifically with its Elastic Block Storage (EBS). They use an architecture of grid computing, where each computing node has local disks and ScaleIo Software. The Software creates a Virtual SAN with local disks, thus providing a highly parallel computing storage nodes SAN, while maintaining HA Enterprise requirements.

ScaleIO Software is allegedly a lightweight piece of Software, and runs alongside with other applications, such as DBs and hyper-visors. They work with all leading Linux distributions and hyper-visors, and offer additional features such as encryption at rest and quality of service (QoS) for performance.

Here’s ScaleIO own competitive smack down:

ScaleIO vs Amazon

AWS: Steps to get a free VM up & running

This post is intended to be a very simplistic post on how to get you started with Amazon Cloud (AWS) with a VM. It will allow you to start experimenting AWS, without any costs (if you take the right steps). Naturally Amazon itself provides more detailed steps here.

What you will need:

  • email account
  • cellphone – for security reasons, to make sure you’re not hosting CPU power for any DDoS and other type of stuff
  • credit card – though you actually need to input credit card details, you can indeed run a free VM in a limit amount of time. Amazon will simply not charge you anything for it, as long as you stick to the greenzone.

Overview of the process:

  1. Sign up for AWS account.
  2. Launch a “t1.micro instance”
  3. Beware of the 750 hours free-use limit.

Simple, right? Here are more detailed steps:

  1. Go to and click on “Sign up”.
  2. Next enter your email address and select “I am a new user.”
  3. Enter Login credentials
  4. Enter your contact information
  5. Enter your payment information
  6. Next you will have an Identity verification through cellphone, which consists of being contacted by an automated system that prompts you to enter the PIN number they provided you.
  7. After that, it may take while until your account is actually activated.
  8. Next go to and click on the “My account Console”, and “AWS Management Console”. After you successfully login, you’ll land at this page:AWS portfolio_II
  9. Click on “EC2  Virtual Servers in the Cloud”.
  10. You will land on the EC2 Dashboard. Right in the middle click on “Launch Instance”.
  11. On the “Create a New Instance” menu select “Quick Launch Wizard”
  12. Then Select “Create new”, enter the name of your new Instance (i.e. VM) and select for instance an Ubuntu, to make sure you stay on the free-way. Nowadays you have two versions of Ubuntu Server available: 12.04.2 LTS and 13.04. If you are following specific tutorials with the previous version of ubuntu, then you might prefer to choose that one. Before hitting continue, make sure you click on the “Download” button. You will download a “pem” file (Privacy Enhance Mail), which is the certificate you will need to sucessfully establish a console session to that VM you’re about to launch. After downloading that file, hit “Continue”.
  13. On the next screen you will be able to confirm that the type of instance you are launching is a “t1.micro”. It is very important you do not change this, if you want to stick with the free experimenting. Hit launch.
  14. Your VM might take a few minutes until the orchestration on Amazon backend is completed. When the initialization is completed, right-click on your instance and select connect.
  15. Select “Connect with a standalone SSH Client”. You will have instructions provided by Amazon there. Make sure you copy the command line provided by Amazon, and launch a terminal session. In Windows I suggest you use Putty. If you’re using a Mac, just go to “Applications > Utilities > Terminal”.
  16. On Mac: Now this part might be a little more tricky if you’re not used to CLI. You need to change the permissions of that “pem” file you downloaded, so change to the directory where you stored the pem file (Usually “Downloads” directory), and change permissions. Enter the following lines, where the text contained in each quotes is a different line: “cd ~/Downloads” , “chmod 400 name-of-the-pem-file-you-downloaded.pem“.
  17. Still on Mac: Now is the time to paste the line you copied from Amazon. Something like “ssh -i name-of-the-pem-file-you-downloaded.pem”
  18. On windows: follow this tutorial to connect with Putty.
  19. There! You should have a Welcome page in your terminal console from Ubuntu.

Up and running. It is very important for you to note that you keep your total usage of t1.micro instances under 750 hours per month to avoid charges from Amazon. In order to do that, you have to terminate (yes delete everything!) your instance. You might not be using CPU power, but you will still be using Storage, which is also included in Amazon’s business…

Now its time for experimenting things. Why not start with a LAMP?

New HP VC Cookbook for iSCSI

Even though its been here for such a long time, iSCSI SANs still have not convinced everyone. Though I do not want to go through that discussion, every help one can get on recommendations how to set it up is helpful. HP just released its “HP Virtual Connect with iSCSI“, so here are some notes about it.

  • Array supportability: Any Array with iSCSI ports.
  • You can use Accelerated iSCSI (and NIC behaves like an iSCSI HBA), if you choose Blade NICs that support so. Max MTU size in such scenario is limited to 8342 bytes, unchangeable. Both on Windows and VMware you have no control, as this is automatically negotiated (I am still refering on Accelerated iSCSI option). On Windows Box, when TCPMSS displays 1436, the MTU negotiated is 1514, and when it displays 8260 the MTU negotiated is 8342.
  • Note that though you can directly connect Storage Host ports to Virtual Connect (except with HP P4000 or any other IP-Clustered Storage), I would definetely not do so, since it limits your options. VC does not behave as a Switch in order to prevent loops and simplify Network management, so traffic will not be forwarded out of the Enclosure (say for instance if you had another Rack Server Host wanting iSCSI). This is also the reason you cannot setup more than one box of HP P4000 to it, since you do require inter box communication for metadata and cluster formation.
  • Do not forget iSCSI basics: Keep it L2, enable 802.3x Flow control everywhere (Storage Ports, Switch ports, Virtual Connect uplink ports (enabled by default on downlink ports, and Host ports if using Software iSCSI) enable Jumbo Frame support on OS level if you are using Software iSCSI (instead of Accelerated iSCSI), configure iSCSI multipathing when possible (either by using the Storage vendor DSM, or if possible the OS own MPIO), dedicate at least an isolated VLAN or with dedicated physical devices (Switches), and finally isolate at least a FlexNIC on the Host for iSCSI traffic.


This figure illustrates a mapping of VC config from the Blade’s FlexNICs, to interior of VC (two SUS with corporate VLANs plus dedicted vNet for iSCSI VLAN, and exterior VC uplinks. For redundancy reasons and to limit contention between Downlinks and Uplinks, one would recommend using at least two Uplink Ports on each VC module dedicated to the iSCSI vNet. Whenever a VC module has a one-to-one relationship towards the upstream ToR switch (or several ToR switches when using MLAG, such as HP IRF, Cisco Catalyst VSS or Nexus vPC, etc). When configuring the iSCSI vNets, make sure you enable “Smart Link” feature, to ensure for faster failovers.

You might also want to take advantage of new VC firmware 4.01 features, which include two helpful ones. The first is flexible overprovision of Tx Bandwidth. Ensure a minimum pipeline for that iSCSI Host port, and let it fly (by assigning higher Max Bandwidth) if other pipes are empty on other FlexNICs.

Finally, the second concerns QoS. You might want to use a dedicated 802.1p class for your iSCSI traffic, which is mapped for Egress traffic.