A look at the Hedvig distributed Hybrid Cloud storage solution

During the recently concluded Storage Field Day event (SFD15), I had the chance to travel to the Software Defined Storage company Hedvig in their HQ in Santa Clara where were given a presentation by their engineering team (including the founder) of their solution offering. Now luckily, I knew Hedvig already due to my day job (which involves evaluating new, disruptive tech start-ups to form solutions reseller partnerships – I had already gone through this process with Hedvig a while back). However I learnt about a number of new updates to their solution and this post aims to cover their current solution offering and my thoughts of it, in the current enterprise storage market.

Hedvig: Company Overview

Similar to a number of new storage or backup start-ups came out of stealth in recent times, Hedvig too was founded by an engineer with a technology background, back in 2012. The founder Avinash Lakshman came from a distributed software engineering background, having worked on large scale distributed storage applications such as Amazon Dynamo and Cassandra. While they came out of stealth in 2015, they did not appear to have an aggressive growth strategy backed by an aggressive (read “loud”) marketing effort behind them and looked rather content at natural, organic growth. At least that was my impression seeing how they operated in the UK market anyway. However, during the SFD15 presentation, we found out that they’ve somewhat revamped their logo and related marketing collateral. So perhaps they may well have started to address this already?

Hedvig: Solution Overview


At the outset, they are similar to most other software defined storage start-ups these days that leverages any commodity server hardware on top of their software tier to build a comparatively low cost, software defined storage (SDS) solution. They also have genuine distributed capability to be able to distribute the SDS nodes not just within the data center, but also across data centers as well as cloud platforms, though it’s important to note most SDS vendors these days have got the same capability or are in the process of adding it to their SDS platforms.

Hedvig has positioned themselves as a SDS solution that is a perfect fit for traditional workload such as VMs, backup & DR as well as modern workloads such as Big data, HPC, object storage and various cloud native workloads too. Their solution provides block & file storage capability like most other vendors in their category, as well as object storage which is again another potentially (good) differentiator, especially compared to some of the other HCI solutions out there that often only provide one type or the other.

The Hedvig storage platform typically consist of Hedvig SW platform + commodity server hardware with local disks. Each server node can be a physical server or a VM on a cloud platform that runs the Hedvig software. The Hedvig software consist of,

  • Hedvig Storage Proxy
    • This is a piece of software deployed on the compute node (app server, container instance, hypervisor…etc.)
    • Presents file (NFS) & block (iSCSI) storage to compute environments and coverts that to Hedvig proprietary communication protocol with storage service.
    • Also performs caching of reads (writes are redirected).
    • Performs dedupe up front and writes deduped blocks to the back end (storage nodes) only if necessary
    • Each hypervisor runs a proxy appliance VM / VSA (x2 as a HA pair) which will serve all local IO on that hypervisor
  • Hedvig API
    • Presents object storage via S3 or Swift and full RESTful API from the storage nodes to the storage proxy.
    • Runs on the storage nodes
  • Hedvig Storage Services
    • Manages the storage cluster activities and interface with server proxies
    • Runs on the storage nodes and similar to the role of a typical storage processor / SAN or NAS controller
    • Each storage server has 2 parts
      • Data process
        • Local persistence
        • Replication
      • Metadata process
        • Communicate with each other
        • Distributed logic
        • Stored in a proprietary DB on each node
    • Each virtual disk provisioned in the front end is mapped 1:1 to a Hedvig virtual disk in the back end

The Hedvig storage nodes can be commodity or mainstream OEM vendor servers as customer’s chose to use. They will consist of SSD + Mechanical drives which is typical for other SDS vendors too and the storage nodes which runs the Storage services SW will typically be connected to each other using 10Gbe (or higher) standard Ethernet networking.

Like most other SDS solutions, they have typical SDS features and benefits such as dedupe, compression, auto-tiering, caching, snapshots & clones, data replication…etc. Another potentially unique offering they have here is the ability to set storage policies per virtual disk or per container granularity (in the back end), which is nice. The below are some of key storage policy configuration items that can be set per VM / vDisk granularity.

  • Replication Factor (RF) – Number of copies of the data to keep. Range form 1-6. Quorum = (RF/2)+1. This is somewhat similar to the VMware vSAN FTT if you are a vSAN person.
  • Replication policy – Agnostic, Rack aware and DC aware – Kind of similar to the concept of Fault Domains in vSAN for example. Set the scope of data replication for availability
  • Dedupe – Global dedupe across the cluster. Happens at 512B or 4K block size and is done in-line. Important to node that dedupe happens at the storage proxy level which is ensures no un-necessary writes take place in the back end. This is another USP compared to other SDS solution which is also nice.
  • Compression
  • Client caching
  • …etc.

Data replication, availability & IO operations

Hedvig stores data as containers across the cluster nodes to provide redundancy and enforce the policy configuration items regarding availability at container level. Each vDisk is broken down to 16GB chunks and based on the RF level assigned to the vDisk, will ensure the number of RF copies are maintained across a number of nodes (This is somewhat similar to VMware vSAN component size which is set at 256GB). Each of these 16GB chunks is what is known as a container. Within each node, Hedvig SW will group 3 disks in to a logical group called a storage pool and each container that belong to that storage pool will typically stripe the data across that storage pool’s disks. Storage pool and disk rebalancing occurs automatically during less busy times. Data replication will also take in to account the latency considerations if the cluster spans across multiple geo boundaries / DCs / Cloud environments.

Hedvig software maintains an IO locality in order to ensure best performance for read and write IOs where it will prioritise servicing IO from local & less busy nodes. One of the key things to note that during a write, the Hedvig software doesn’t wait for all the acknowledgement from all the storage nodes unlike some of its competitor solutions. As soon as the quorum is met (Quorum = RF/2 + 1, so if the RF is 4, with a remote node on the cloud or on a remote DC over a WAN link, as soon as the data is written locally to 3 local nodes), it will send the ACK back to the sender and the rest of the data writing / offloading can happen in the background. This ensures the faster write response times, and is probably a key architectural element in how they enable truly distributed nodes in a cluster, which can often include remote nodes over a low latency link, without a specific performance hit to a write operation. This is another potential USP for them, at least architecturally on paper, however in reality, will only likely to benefit if you have a higher RF factor in a large cluster.

Reads are also optimised through using a combination of caching at the storage proxy level as well as actual block reads in the back end prioritising local nodes (with a lower cost) to remote nodes. This is markedly different to how VMware vSAN works for example where it avoids the client-side cache locality in order to avoid skewed flash utilisation across the cluster as well as frequent cache re-warning during VMotion…etc. Both architectural decisions have their pros and cons in my view and I like Hedvig’s architecture as it optimises performance which is especially important in a truly distributed cluster.

A deep dive on this subject including the anatomy of a read and a write is available here.

Hedvig: Typical Use Cases

Hedvig, similar to most of its competition, aim to address number of use cases.

Software Defined Primary Storage

Hedvig operates in traditional storage mode (dedicated storage server nodes providing storage to a number of external compute nodes such as VMware ESXi or KVM or even a bare metal application server) or in Hyper-Converged mode where both compute and storage is provided on a single node. They also state that these deployment architectures can be mixed in the same cluster which is pretty cool.

  • Traditional SDS – Agent (storage proxy) running on the application server accessing the storage and speaks storage protocols. Agent also host local metadata and provide local caching amongst other things. Used in a non-hypervisor deployment such as bare metal deployments of app servers.
  • HCI mode – Agent (storage proxy) running on the Hypervisor (as a control VM / VSA – Similar to Nutanix). This is probably their most popular deployment mode.

Software Defined Hybrid Cloud Storage

Given the truly distributed nature of Hedvig solution platform, they provide a nice Hybrid cloud use case for the complete solution to extend the storage cluster across geographical boundaries including cloud platforms (IaaS instances). Currently supported cloud platforms by Hedvig include AWS, Azure and GCP. Stretching a cluster over to a cloud platform would involve IaaS VMs from the cloud platform being used as cluster nodes with available block storage from the cloud platform providing virtual disks as local drives for each cloud node. When you define Hedvig virtual disks, you get specify the data replication topology across the hybrid cloud. Important to note though that the client accessing those disks will be advised to be run within the same data center / cloud platform / region for obvious performance reasons.

Hedvig also now supports integrating with Docker for containerised workloads through their Docker volume plugin & integration with Kubernetes volume integration framework, similar to most of the other SDS solutions.

Hyper-Converged Backup

This is a something they’ve recently introduced but unless I’ve misunderstood, this is not so much a complete backup solution including offsite backups, but more of a snapshot capability at the array level (within the Hedvig layer). Again, this is similar to most other array level snapshots from other vendor’s solutions and can be used for immediate restores without having to rely on a hypervisor snapshot which would be inefficient. An external backup solution using a backup partner (such as Veeam for example) to offsite those snapshot backups is highly recommended as with any other SDS solution.

My thoughts

I like the Hedvig solution and some of its neat littles tricks such as the clever use of the storage proxy agent to offload some of the backend storage operations to the front end (i.e. dedupe) and therefore potentially reduce back end IO as well as network performance penalty to a minimum between the compute and storage layers. They are a good hybrid SDS solution that can cater for a mixed workload across the private data center as well as public cloud platforms. It’s NOT a specialised solution for a specific workload and doesn’t claim to provide a sub millisecond latency solution and instead, provide a good all-around storage solution that is architected from ground up to be truly distributed. Despite its ability to be used in a traditional storage as well as HCI mode, most of the real-life applications of its technology however, would likely be in a HCI world, with some kind of a Hyper-visor like vSphere ESXi or KVM.

Looking at the organisation itself and their core solution, it’s obvious that they’ve tried to solve a number of hardware defined storage issues that were prevalent in the industry at the time of their inception (2012), through the clever use of software. That act is commendable. However, the sad truth is that, since then, a lot has happened in the industry and a number of other start-ups and established vendors have also attempted to do the same, some with perhaps an unfair advantage due to having their own hypervisor too, which is a critical factor when it comes to your capabilities. Nutanix and VMware vSAN for example, developed similar SDx design principles and tried to address most of the same technical challenges. I fear that those vendors and their solutions were little aggressive in their plans and managed to get their go to market process right in my view, at a much bigger scale as well. Nutanix pioneered in creating a new SDS use case (HCI) in the industry and capitalised on it before everyone else did and VMware vSAN came out as a credible, and potentially better challenger to dominate this space. While Hedvig is independent from a hypervisor platform and therefore provide same capabilities across multiple platforms, the reality is that not many customers would need that capability as they’d be happy with a single Hypervisor & a storage platform. I also think Hedvig potentially missed a trick in their solution positioning in the market to create a differentiated message and win market share. As a result, their growth is nowhere near comparable to that of VMware vSAN or Nutanix for example.

As much as I like the Hedvig technology, I fear for their future and their future survival. Without some significant innovation and some business leadership involved in setting a differentiated strategy for their business, life would be somewhat be difficult, especially if they are to make a commercial success out of the as a company. Their technology is good and engineering team seems credible, but the competition is high and the market is somewhat saturated with so many general purpose SDS solutions as well as specialist SDS solutions aimed at specific workloads. Most of their competition also have much more resources at their disposal to throw at their solution, including more comprehensive marketing engines too. For these reasons, I fear that Hedvig may struggle to survive in their current path of generalised SDS solution and would potentially be better off in focusing on a specific use case / vertical …etc and focusing all their innovation efforts on that.

The founder and the CEO of the company still appears to be very much an engineer at heart still and having an externally sourced business leader with start-up experience to lead Hedvig in to the future may not be a bad thing for them in the long run either, in my view.

Keen to get your thoughts, especially if you are an existing Hedvig customer – Please comment below.

Slide credit goes to Hedvig and Tech Field Day team.

P.S. You can find all the TFD and SFD presentations about Hedvig via the link here.

Chan

Storage Field Day 15 – Watch Live Here

Following on from my previous post about the vendor line-up and my plans during the event, this post is to share the exact vendor presentation schedule and some additional details.

Watch LIVE!

Below is the live streaming link to the event on the day if you’d like to join us LIVE. While the time difference might make it a little tricky for some, it is well worth taking part in as all the viewers will also have the chance to ask questions from the vendors live, similar to the delegates onset. Just do it, you won’t be disappointed!

Session Schedule

Given below is the session schedule throughout the event, starting from Wednesday the 7th. All times are in Pacific time (-8 hours from UK time)

Wednesday the 7th of March

    • 09:30 – 11:30 (5:30-7:30pm UK time) – WekaIO presents
    • 13:00 – 15:00 (9-11pm UK time) – IBM presents
    • 16:00 – 18:00 (12-2am 8th of March, UK time) Dropbox presents

Thursday the 8th of March

  • 08:00-10:00 (4-6pm UK time) – Hedvig presents from their Santa Clara offices
  • 10:30-12:30 (6:30-8:30pm UK time) NetApp presents from their Santa Clara offices
  • 13:30-15:30 (9:30-11:30pm UK time) – Western Digital/Tegile presents from Levi’s Stadium
  • 16:00-18:00 (12-2am 9th of March, UK time) – Datrium presents from Levi’s Stadium

Friday the 9th of March

  • 08:00-10:00 (4-6pm UK time) – StarWinds presents in the Seattle Room
  • 11:00-13:00 (7-9pm UK time) – Cohesity presents at their San Jose offices
  • 14:00-16:00 (10pm-12am UK time) – Huawei presents at their Santa Clara offices