Category Archives: General

NetApp & Next Generation Storage Technologies

There are some exciting technology developments taking place in the storage industry, some behind closed doors but some that are also publicly announced and already commercially available that most of you may already have come across. Some of these are organic developments to build on existing technologies but some are inspired by megascalers like AWS, Azure, GCP and various other cloud platforms. I’ve been lucky enough to be briefed on some of these when I was at SFD12 last year I the Silicon Valley, by SNIA – The Storage and Networking Industry Association that I’ve previously blogged about here.

This time around, I was part of the Storage Filed Day (SFD15) delegate panel that got a chance to visit NetApp at their HQ at Sunnyvale, CA to find out more about some of exciting new product offerings that are in NetApp’s roadmap, either in the works or starting to just come out, incorporating some of these new storage technologies. This post aim to provide a summary of what I learnt there and my respective thoughts.

Introduction

It is no secret that Flash media has changed the dynamics of the storage market over the last decade due to their inherent performance characteristics. While the earliest incarnations of flash media were prohibitively expensive to be used in mass quantities, the invention of SSDs commoditised the use of flash media across the entire storage industry. For example, most tier 1 workloads in the enterprises today are held on a SSD backed storage system where SSD disk drives form the whole or a key part of the storage media stack.

When you look at some of the key storage solutions in use today, there are 2 key, existing flash technologies that stand out, DRAM & SSD. DRAM is the fastest possible flash storage media that is most easily accessible by the data processing compute subsystem while SSD’s fall in to next best place when it comes to speed of access and the level of performance (IOPS & bandwidth). As such, most enterprise storage solutions in the world, be that the ones aimed at the customer data centers or on the megascaler’s cloud platforms utilise one or both of these flash media types to either accelerate (caching) or simply store tier 1 data sets.

It is important to note that, while the SSD’s benefitted from the overall higher performance and lower latency compared to mechanical drives due to the internal architecture of the SSD disks themselves (flash storage cells that don’t require spinning magnetic media), both the SSD drives and classic mechanical (spinning) drives are typically attached & accessed by the compute subsystem via the same SATA or the SaS interface subsystem with the same interface speed & latency. Often the internal performance of an SSD was not fully realised to its maximum potential, especially in an aggregated scenario like that of an enterprise storage array, due to these interface controller access speed and latency limitations, as illustrated in the diagram below.

One of the more recent technology developments in the storage and compute industry, namely “Non-Volatile Memory Express” (NVMe) aims to address these SAS & SATA interface driven performance and the latency limitations through the introduction of new, high performance host controller interface that has been engineered from the ground up to be able to fully utilise flash storage drives. This new NVMe storage architecture is designed to be future proof and would be compatible with various future disk drive technologies that are NAND based as well as non-NAND based storage media.

NVMe SSD drives connected via these NVMe interfaces will not only outperform traditional SSD drives attached via SAS or SATA, but most importantly will enable higher future capabilities such as being able to utilise Remote Direct Memory Address (RDMA) for super high storage performance extending the storage subsystem over a fabric of interconnected storage and compute nodes. A good introduction to the NVMe technology and its benefits over SAS / SATA interfaces can be viewed here.

Another much talked about development on the same front is the subject of the Storage Class Memory (SCM) – Also known as Persistent Memory (PMEM). SCM is an organic successor to the NAND technology based SSD drives that we see in mainstream use in flash accelerated as well as all flash storage arrays today.

At a theoretical level, SCM can come in 2 main types as shown in the above diagram (from a really great IBM research paper published in 2013).

  • M-Type SCM (Synchronous) = Incorporate non-volatile memory based storage in to the memory access subsystem (DDR) rather than SCSI block based storage subsystem through PCIe, achieving DRAM like throughput and latency benefits for persistent storage. Typically take the form of NVDIMM (that is attached to the memory BUS, similar to traditional DRAM) which is the fastest and best performant thing, next to DRAM itself. Uses memory card slots and appear to the system to use as a caching layer or as pooled memory (extended DRAM space) depending on the NVDIMM type (NVDIMMs come in 3 types, NVDIMM-N, NVDIMM-F and NVDIMM-P. A good explanation available here).
  • S-Type SCM (Asynchronous) = Incorporate non-volatile memory based storage but attached via the PCIe connector to the storage subsystem. While this is theoretically slower than the above, it’s still significantly faster than NAND based SSD drives that are in common use today, including those attached via NVMe host controller interface. Intel and Samsung both have already launched S-type SCM drives, Intel with their 3D XPoint architecture and Samsung with Z-SSD respectively but current drive models available are aimed more at consumer / workstation rather than server workloads. Server based implementations of similar SCM drives will likely arrive around 2019. (Along with supported server based software included within operating systems such as Hypervisors – vSphere 7 anyone?)

The idea of the SCM is to address the latency and performance gap that exist in every computer system when it comes to memory and storage since the advent of X86 computing. Typically, access latency for DRAM is around 60ns, and the next best option today, NVMe SSD drives will have a typical latency of around 20-200us and the SCM will fit in between these 2, at a typical latency between 60ns-20uS, depending on the type of the SCM, with a significantly high bandwidth that is incomparable to SSD drives. It is important to note however that most ordinary workloads do not need this type of super latency sensitive, extremely high bandwidth storage performance, the next generation data technologies involving Artificial Intelligence techniques such as machine learning, real-time analytics that relies on processing extremely large swathes of data at super quick time, would absolutely benefit, and in most instance, necessitate the need for these next gen storage technologies to be fully effective.

NetApp’s NVMe & SCM vision

NetApp was one of the first classic storage vendors who incorporate flash in to their storage systems, in an efficient manner to accelerate the workloads that is typically stored on spinning disks. This started with the concept of NVRAM that was included in their flagship FAS storage solutions as an acceleration layer. Then came the flash cache (PAM cards) which were flash media attached via the PCIe subsystem to act as a cashing layer for reads which was also popular. Since the advent of all flash storage arrays, NetApp went another step by introducing all flash storage in to their portfolio through the likes of All Flash FAS platform that was engineered and tuned for all flash media as well as the EF series.

NetApp innovation and constant improvement process hasn’t stopped there. During SFD15 event, we were treated to the next step of this technology evolution by NetApp when they discussed how they plan to incorporate the above mentioned NVMe and SCM storage technologies in to their storage portfolio, in order to provide next gen storage capabilities to serve next gen use cases such as AI, big data and real-time analytics. Given below is a holistic roadmap plan of where NetApp see NVMe and SCM technologies fitting in to their roadmap, based on the characteristics, benefits and costs of each technology.

The planned use of NVMe is clearly in 2 different points of the host->storage array communication path.

  • NVMe SSD drives : NVMe SSD drives in a storage array, attached via NVMe host controller interface in order to be able to fully utilise the latency and throughput potential of the SSD drives themselves by the storage processor (in the controllers). This will provide additional performance characteristics to the existing arrays.
  • NVMe-OF : NVMe over fabric which is attached to the storage consumer nodes (Servers) via a ultra-low latency NVMe fabric. NVMe-OF enable the use of RDMA capabilities to reduce the distance between the IO generator and the IO processor thereby significantly reducing the latency. NVMe-OF therefore is widely touted to be the next big thing in storage industry and a number of specialists start-ups like Excelero have already come out to market with specialist solutions and you can find out more about it in my blog here. An example of the NVMe-OF storage solution available from NetApp is the new NetApp EF570 all flash array. This product is already shipping and more details can be found here or here. This platform offers some phenomenal performance numbers at ultra-low latency, built around their trusted, mature, feature rich, yet simple EF storage platform which is also a bonus.

The planned (or experimented) use of SCM is in 2 specific areas of the storage stack, driven primarily by the costs of the media vs the need for acceleration.

  • Storage controller side caching:        NetApp mentioned that some of the experiments they are working on with prototype solutions already built are looking at using SCM media on the storage controllers as a another tier to accelerate performance, in the same way PAM cards or Flash cache was used on the older FAS system. This a relatively straight forward upgrade and would be specially effective in an all flash FAS solution with SSD drives in the back end where a traditional flash cache card based on NAND cells would be less effective.
  • Server (IO generator) side caching:        This use case looks at using the SCM media on the host compute systems that generates the IO to act as a local cache, but most importantly, used in conjunction with the storage controllers rather than in isolation, performing tiering and snapshots from the host cache to a backend storage system like an All Flash FAS.
  • NetApp are experimenting on this front primarily using their recent acquisition of Plexistor and their proprietary software that performs the function of combining DRAM and SCM as a single address space that is byte addressable (via memory semantics which is much faster than scsi / NVMe addressable storage) and presenting that to the applications as a cache while also presenting the backend NetApp storage array such as an All Flash FAS as a persistent storage tier. The applications achieve significantly lower latency and ultra-high throughput this way through caching the hot data using the Plexistor file system which incidentally bypasses the complex Linux IO stack (Comparison below). The Plexistor tech is supposed to provide enterprise grade feature as a part of the same software stack though the specifics of what those enterprise grade features meant were lacking (Guessing the typical availability and management capabilities as natively available within OnTAP?)

Based on some of the initial performance benchmarks, the effect of this is significant, as can be seen below when compared to a normal

My thoughts

As an IT strategist and an Architect at heart with a specific interest in storage who can see super data (read “extremely large quantities of data”) processing becoming a common use case soon across most industries due to the introduction of big data, real-time analytics and the accompanying Machine Learning tech, I can see value in this strategy from NetApp. Most importantly, they are looking at using these advanced technologies in harmony with some the proven, tried and tested data management platforms they already have in the likes of OnTAP software could be a big bonus. The acquisition of Plexistor was a good move for NetApp and integrating their tech and having a shipping product would be super awesome if and when that happens but I would dare say the use cases would be somewhat limited prohibitive initially given the Linux dependency. Others are taking note and the HCI vendor Nutanix’s acquisition of PernixData kind of hints Nutanix also having a similar strategy to that of Plexistor and NetApp.

While the organic growth of current product portfolio with capabilities through incorporating new tech such as NVMe is fairly straight forward and help NetApp stay relevant, it remains to be seen however how well acquisition driven integration such as that of Plexistor with SCM technologies to the NetApp platform would pan out to become a shipping product. NetApp has historically had issues around the efficiency of this integration process which in the past has known to be slow but this time around, under the new CEO George Kurian who brought in a more agile software development methodology and therefore, a more frequent feature & update release cycle, things may well be different this time around. The evidence seen during SFD15 pretty much suggest the same to me which is great.

Slide credit to NetApp!

Thanks

Chan

Cohesity: A secondary storage solution for the Hybrid Cloud?

Background

A key part of my typical day job involves staying on top of new technologies and key developments in the world of enterprise IT, with an aim to spot commercially viable, disruptive technologies that are not just cool tech but also have a good business value proposition with a sustainable use case.

To this effect, I’ve been following Cohesity since its arrival to the mainstream market back in 2015, keeping up to date on some of their platform developments with various feature upgrades such as v2.0, v3.0…etc with interest. SFD15 gave me another opportunity to catch up with them and get an up to date view on their latest offerings & the future direction. I liked what I heard from them! Their solution now looks interesting, their marketing message is a little sharper than it was a while ago and I like the direction they are heading in.

Cohesity: Overview


Cohesity claims to be a specialist, software defined, secondary storage vendor who specializes in modernization of the secondary storage tier within the hybrid cloud. Such secondary storage requirements typically include copies of your primary / tier 1 data sets (Such as test & dev VM data and reporting & analytics data) or file shares (CIFS, NFS…etc.). These types of data  tends to be often quite large and therefore typically cost more to store and process. Therefor storing them on the same storage solution as your tier 1 data can be un-necessarily expensive which I can relate to, as an enterprise storage customer as well as a channel SE in my past lives, involved in sizing and designing various storage solutions for my customers. Often, most enterprise customers need separate, dedicated storage solutions to store such data outside of the primary storage cluster but they are stuck with the same, expensive primary storage vendors for choice. Cohesity offers to provide a single, tailor made secondary data platform that spans across both ends of the hybrid cloud to address all these secondary storage requirements. They also provide the ability to act as a hybrid cloud backup storage target too with some added data management capabilities on top so that not only can they store data backups, but also do interesting things with those backup data, across the full Hybrid Cloud spectrum.

With what appears to be decent growth last year (600% revenue growth YoY) and some good customers already onboard, it appears that customers may be taking notice too.

Cohesity: Solution Architecture


A typical Cohesity software defined storage (SDS) solution on-premises comes as an appliance and can start with 3 nodes to form a cluster that provide linear scalable growth. An appliance will typically be a 2U chassis that accommodate 4 nodes and any commodity or an OEM HW platform is supported. Storage itself consist of PCI-e Flash (up to 2TB per node) + capacity disk, which is the typical storage architecture of every SDS manufacturer these days. Again, similar to most other SDS vendors, Cohesity uses Erasure coding or RF2 data sharding across the Cohesity nodes (within each cluster) to provide data redundancy, as a part of the SpanFS file system. Note that given its main purpose as a secondary storage unit, it doesn’t have (or need) an All Flash offering, though they may move in to the primary storage use case, at least indirectly in the future.

Cohesity storage solution can be deployed across to remote and branch office locations as well as to cloud platforms using virtual Cohesity appliances to work hand in hand with the on-premises cluster. Customers can then enable cross cluster data replication and various other integration / interaction activities in a similar way to NetApp Data Fabric works for example for primary data. Note however that Cohesity does not permit the configuration of a single cluster across platforms as of yet (where you can deploy nodes from the same cluster on premises as well as on the cloud enabling Erasure Coding to perform data replication in the way Hedvig storage solution permits for example), but we were hinted that this is in the works for a future release.

Cohesity also have some analytics capabilities built in to the platform which can be handy. The analytics engine uses MapReduce natively within its engine to avoid the need to build external analytic focused compute clusters (such as Hadoop clusters) and having to move (duplicate) data sets to be presented for analysis. The Analytics Workbench on Cohesity platform currently permits external custom code to be injected in to the platform. This can be used to search for contents inside various files held on the Cohesity platform including pattern matching that enables customers to search for social security or credit card numbers which would be quite handy to enforce regulatory compliance. During the SFD15 presentation, we were explained that the capabilities of this platform is being rapidly enhanced to enhance additional regulatory compliance policy enforcements such as those of GDPR. Additional information on Cohesity Analytics capabilities can be found here. Additional video explaining how this works can also be found here.

Outside of these, given the whole Cohesity solution is backed by a distributed file system that is software defined, they naturally have all the software defined goodness expected from any SDS solution such as global deduplication, compression, replication, file indexing, snapshots, multi protocol access, Multi tenancy and QoS within their platform.

My thoughts

I like Cohesity’s current solution and where they are potentially heading. However, the key to their success in my view, would ultimately be their price point which I am yet to see to make sense of where they belong amongst competition.

From a technology and strategy standpoint, Cohesity’s key use cases are very valid and the way they aim to address those is pretty damn good. When you think about the secondary storage use case, cost of serving out less performance hungry, tier 2 data (often large and clunky in size) through an expensive tier 1 storage array (where you have to include larger SAN & NAS storage controllers + additional storage), I cannot help but think that Cohesity’s secondary storage play is quite relevant for many customers. Tier 1 storage solutions, classic SAN /NAS solutions as well HCI solutions such as VMware vSAN or Nutanix, are typically priced to reflect their tier 1 use case. So, a cheaper, more appropriate secondary storage solution such as Cohesity could help save lots of un-necessary SAN / NAS / HCI costs for many customers by being able to now downsize their primary storage solution requirements. This may even further enable more and more customers to embrace HCI solutions for their tier 1 workload too resulting in even less of a need to have expensive, hardware centric SAN / NAS solutions except for when they are genuinely necessary. After all, we are all being taught the importance of rightsizing everything (thanks to the utility computing model introduced by the Public clouds), so perhaps it’s about time that we all look to break down the tier 1 and tier 2 data in to appropriately sized tier 1 and tier 2 storage solutions to benefit from the reduced TCO for the customer? It’s important to note though, that this rightsizing will only likely going to appeal to customers with heavy storage use cases such as typical enterprises and large corporate customers rather than the average small to medium customer who requires a typical multipurpose storage solution to host some VMs + some file data. This is evident in the customer stats provided to us during SFD15, where 70% of their customers are enterprise customers.

Both their 2 key use cases, Tier 2 data storage as well as backup storage now looks to incorporate cloud capabilities and allows customers to do more than just storing tier 2 data and storing back ups. This is good and is very time relevant indeed. They seem to take a very data centric approach to their use cases and their secret source behind most of the capabilities, the proprietary file system called SpanFS looks and feels very much like NetApp’s cDOT architecture with some enhancements in parts. They are also partnering up with various primary storage solutions such as Pure to enable replication of backup snapshots from Pure to Cohesity, while introducing additional features like built in NAS data protection from NetApp, EMC, Pure, direct integration with VMware vCF for data protection, direct integration with Nutanix for AHV protection kind of moves them closer to Rubrik’s territory which is interesting and ultimately provides customers the choice which is a good thing.

From a hardware & OEM standpoint, Cohesity has partnered up with both HPe and Cisco already and have also made themselves available on HPe pricebook so that customers can order the Cohesity solution using a HPe SKU which is convenient, though I’d personally urge customers to order directly from Cohesity (using your trusted solutions provider) where possible, rather than ordering through an OEM vendor where the pricing may be fixed or engineered to position OEM HW when its not always required.

Given their mixed capabilities of tier 2 data storage, backup storage, and ever-increasing data management capabilities across platforms, they are coopeting if not competing with a number of others such as NetApp who has a similar data management strategy in their “Data pipeline” vision (who also removes the need to have multiple storage silos in the DC for Tier 2 data due to features such as Clustered Data OnTAP & FlexClones), Veeam or even Pure storage. Given their direct integration with various SW & HCI platforms removing the need to have 3rd party backup vendors, they are likely going to be competing directly with Rubrik more and more in the future. Cohesity’s strategy is primarily focused on tier 2 data management and the secondary focus is on data backups and management of that data whereas Rubrik’s strategy appears to be the same but opposite order of priorities (backup 1st, data management 2nd). Personally, I like both vendors and their solution positioning’s as I can see the strategic value in both solutions offerings for customers. But most importantly for Cohesity, there don’t appear to be any other storage vendor, specifically focused on the secondary storage market like they do so I can see a great future for them, as long as their price point remains relevant and that great innovation keeps continuing.

You can watch all the videos from the #SFD15 recorded at the Cohesity HW in Santa Clara here.

If you are an existing Cohesity user, I’d be very keen to get your thoughts, feedback using the comments section below.

A separate post to follow looking at Cohesity’s SmapFS file system and their key use cases!

Chan

A look at the Hedvig distributed Hybrid Cloud storage solution

During the recently concluded Storage Field Day event (SFD15), I had the chance to travel to the Software Defined Storage company Hedvig in their HQ in Santa Clara where were given a presentation by their engineering team (including the founder) of their solution offering. Now luckily, I knew Hedvig already due to my day job (which involves evaluating new, disruptive tech start-ups to form solutions reseller partnerships – I had already gone through this process with Hedvig a while back). However I learnt about a number of new updates to their solution and this post aims to cover their current solution offering and my thoughts of it, in the current enterprise storage market.

Hedvig: Company Overview

Similar to a number of new storage or backup start-ups came out of stealth in recent times, Hedvig too was founded by an engineer with a technology background, back in 2012. The founder Avinash Lakshman came from a distributed software engineering background, having worked on large scale distributed storage applications such as Amazon Dynamo and Cassandra. While they came out of stealth in 2015, they did not appear to have an aggressive growth strategy backed by an aggressive (read “loud”) marketing effort behind them and looked rather content at natural, organic growth. At least that was my impression seeing how they operated in the UK market anyway. However, during the SFD15 presentation, we found out that they’ve somewhat revamped their logo and related marketing collateral. So perhaps they may well have started to address this already?

Hedvig: Solution Overview


At the outset, they are similar to most other software defined storage start-ups these days that leverages any commodity server hardware on top of their software tier to build a comparatively low cost, software defined storage (SDS) solution. They also have genuine distributed capability to be able to distribute the SDS nodes not just within the data center, but also across data centers as well as cloud platforms, though it’s important to note most SDS vendors these days have got the same capability or are in the process of adding it to their SDS platforms.

Hedvig has positioned themselves as a SDS solution that is a perfect fit for traditional workload such as VMs, backup & DR as well as modern workloads such as Big data, HPC, object storage and various cloud native workloads too. Their solution provides block & file storage capability like most other vendors in their category, as well as object storage which is again another potentially (good) differentiator, especially compared to some of the other HCI solutions out there that often only provide one type or the other.

The Hedvig storage platform typically consist of Hedvig SW platform + commodity server hardware with local disks. Each server node can be a physical server or a VM on a cloud platform that runs the Hedvig software. The Hedvig software consist of,

  • Hedvig Storage Proxy
    • This is a piece of software deployed on the compute node (app server, container instance, hypervisor…etc.)
    • Presents file (NFS) & block (iSCSI) storage to compute environments and coverts that to Hedvig proprietary communication protocol with storage service.
    • Also performs caching of reads (writes are redirected).
    • Performs dedupe up front and writes deduped blocks to the back end (storage nodes) only if necessary
    • Each hypervisor runs a proxy appliance VM / VSA (x2 as a HA pair) which will serve all local IO on that hypervisor
  • Hedvig API
    • Presents object storage via S3 or Swift and full RESTful API from the storage nodes to the storage proxy.
    • Runs on the storage nodes
  • Hedvig Storage Services
    • Manages the storage cluster activities and interface with server proxies
    • Runs on the storage nodes and similar to the role of a typical storage processor / SAN or NAS controller
    • Each storage server has 2 parts
      • Data process
        • Local persistence
        • Replication
      • Metadata process
        • Communicate with each other
        • Distributed logic
        • Stored in a proprietary DB on each node
    • Each virtual disk provisioned in the front end is mapped 1:1 to a Hedvig virtual disk in the back end

The Hedvig storage nodes can be commodity or mainstream OEM vendor servers as customer’s chose to use. They will consist of SSD + Mechanical drives which is typical for other SDS vendors too and the storage nodes which runs the Storage services SW will typically be connected to each other using 10Gbe (or higher) standard Ethernet networking.

Like most other SDS solutions, they have typical SDS features and benefits such as dedupe, compression, auto-tiering, caching, snapshots & clones, data replication…etc. Another potentially unique offering they have here is the ability to set storage policies per virtual disk or per container granularity (in the back end), which is nice. The below are some of key storage policy configuration items that can be set per VM / vDisk granularity.

  • Replication Factor (RF) – Number of copies of the data to keep. Range form 1-6. Quorum = (RF/2)+1. This is somewhat similar to the VMware vSAN FTT if you are a vSAN person.
  • Replication policy – Agnostic, Rack aware and DC aware – Kind of similar to the concept of Fault Domains in vSAN for example. Set the scope of data replication for availability
  • Dedupe – Global dedupe across the cluster. Happens at 512B or 4K block size and is done in-line. Important to node that dedupe happens at the storage proxy level which is ensures no un-necessary writes take place in the back end. This is another USP compared to other SDS solution which is also nice.
  • Compression
  • Client caching
  • …etc.

Data replication, availability & IO operations

Hedvig stores data as containers across the cluster nodes to provide redundancy and enforce the policy configuration items regarding availability at container level. Each vDisk is broken down to 16GB chunks and based on the RF level assigned to the vDisk, will ensure the number of RF copies are maintained across a number of nodes (This is somewhat similar to VMware vSAN component size which is set at 256GB). Each of these 16GB chunks is what is known as a container. Within each node, Hedvig SW will group 3 disks in to a logical group called a storage pool and each container that belong to that storage pool will typically stripe the data across that storage pool’s disks. Storage pool and disk rebalancing occurs automatically during less busy times. Data replication will also take in to account the latency considerations if the cluster spans across multiple geo boundaries / DCs / Cloud environments.

Hedvig software maintains an IO locality in order to ensure best performance for read and write IOs where it will prioritise servicing IO from local & less busy nodes. One of the key things to note that during a write, the Hedvig software doesn’t wait for all the acknowledgement from all the storage nodes unlike some of its competitor solutions. As soon as the quorum is met (Quorum = RF/2 + 1, so if the RF is 4, with a remote node on the cloud or on a remote DC over a WAN link, as soon as the data is written locally to 3 local nodes), it will send the ACK back to the sender and the rest of the data writing / offloading can happen in the background. This ensures the faster write response times, and is probably a key architectural element in how they enable truly distributed nodes in a cluster, which can often include remote nodes over a low latency link, without a specific performance hit to a write operation. This is another potential USP for them, at least architecturally on paper, however in reality, will only likely to benefit if you have a higher RF factor in a large cluster.

Reads are also optimised through using a combination of caching at the storage proxy level as well as actual block reads in the back end prioritising local nodes (with a lower cost) to remote nodes. This is markedly different to how VMware vSAN works for example where it avoids the client-side cache locality in order to avoid skewed flash utilisation across the cluster as well as frequent cache re-warning during VMotion…etc. Both architectural decisions have their pros and cons in my view and I like Hedvig’s architecture as it optimises performance which is especially important in a truly distributed cluster.

A deep dive on this subject including the anatomy of a read and a write is available here.

Hedvig: Typical Use Cases

Hedvig, similar to most of its competition, aim to address number of use cases.

Software Defined Primary Storage

Hedvig operates in traditional storage mode (dedicated storage server nodes providing storage to a number of external compute nodes such as VMware ESXi or KVM or even a bare metal application server) or in Hyper-Converged mode where both compute and storage is provided on a single node. They also state that these deployment architectures can be mixed in the same cluster which is pretty cool.

  • Traditional SDS – Agent (storage proxy) running on the application server accessing the storage and speaks storage protocols. Agent also host local metadata and provide local caching amongst other things. Used in a non-hypervisor deployment such as bare metal deployments of app servers.
  • HCI mode – Agent (storage proxy) running on the Hypervisor (as a control VM / VSA – Similar to Nutanix). This is probably their most popular deployment mode.

Software Defined Hybrid Cloud Storage

Given the truly distributed nature of Hedvig solution platform, they provide a nice Hybrid cloud use case for the complete solution to extend the storage cluster across geographical boundaries including cloud platforms (IaaS instances). Currently supported cloud platforms by Hedvig include AWS, Azure and GCP. Stretching a cluster over to a cloud platform would involve IaaS VMs from the cloud platform being used as cluster nodes with available block storage from the cloud platform providing virtual disks as local drives for each cloud node. When you define Hedvig virtual disks, you get specify the data replication topology across the hybrid cloud. Important to note though that the client accessing those disks will be advised to be run within the same data center / cloud platform / region for obvious performance reasons.

Hedvig also now supports integrating with Docker for containerised workloads through their Docker volume plugin & integration with Kubernetes volume integration framework, similar to most of the other SDS solutions.

Hyper-Converged Backup

This is a something they’ve recently introduced but unless I’ve misunderstood, this is not so much a complete backup solution including offsite backups, but more of a snapshot capability at the array level (within the Hedvig layer). Again, this is similar to most other array level snapshots from other vendor’s solutions and can be used for immediate restores without having to rely on a hypervisor snapshot which would be inefficient. An external backup solution using a backup partner (such as Veeam for example) to offsite those snapshot backups is highly recommended as with any other SDS solution.

My thoughts

I like the Hedvig solution and some of its neat littles tricks such as the clever use of the storage proxy agent to offload some of the backend storage operations to the front end (i.e. dedupe) and therefore potentially reduce back end IO as well as network performance penalty to a minimum between the compute and storage layers. They are a good hybrid SDS solution that can cater for a mixed workload across the private data center as well as public cloud platforms. It’s NOT a specialised solution for a specific workload and doesn’t claim to provide a sub millisecond latency solution and instead, provide a good all-around storage solution that is architected from ground up to be truly distributed. Despite its ability to be used in a traditional storage as well as HCI mode, most of the real-life applications of its technology however, would likely be in a HCI world, with some kind of a Hyper-visor like vSphere ESXi or KVM.

Looking at the organisation itself and their core solution, it’s obvious that they’ve tried to solve a number of hardware defined storage issues that were prevalent in the industry at the time of their inception (2012), through the clever use of software. That act is commendable. However, the sad truth is that, since then, a lot has happened in the industry and a number of other start-ups and established vendors have also attempted to do the same, some with perhaps an unfair advantage due to having their own hypervisor too, which is a critical factor when it comes to your capabilities. Nutanix and VMware vSAN for example, developed similar SDx design principles and tried to address most of the same technical challenges. I fear that those vendors and their solutions were little aggressive in their plans and managed to get their go to market process right in my view, at a much bigger scale as well. Nutanix pioneered in creating a new SDS use case (HCI) in the industry and capitalised on it before everyone else did and VMware vSAN came out as a credible, and potentially better challenger to dominate this space. While Hedvig is independent from a hypervisor platform and therefore provide same capabilities across multiple platforms, the reality is that not many customers would need that capability as they’d be happy with a single Hypervisor & a storage platform. I also think Hedvig potentially missed a trick in their solution positioning in the market to create a differentiated message and win market share. As a result, their growth is nowhere near comparable to that of VMware vSAN or Nutanix for example.

As much as I like the Hedvig technology, I fear for their future and their future survival. Without some significant innovation and some business leadership involved in setting a differentiated strategy for their business, life would be somewhat be difficult, especially if they are to make a commercial success out of the as a company. Their technology is good and engineering team seems credible, but the competition is high and the market is somewhat saturated with so many general purpose SDS solutions as well as specialist SDS solutions aimed at specific workloads. Most of their competition also have much more resources at their disposal to throw at their solution, including more comprehensive marketing engines too. For these reasons, I fear that Hedvig may struggle to survive in their current path of generalised SDS solution and would potentially be better off in focusing on a specific use case / vertical …etc and focusing all their innovation efforts on that.

The founder and the CEO of the company still appears to be very much an engineer at heart still and having an externally sourced business leader with start-up experience to lead Hedvig in to the future may not be a bad thing for them in the long run either, in my view.

Keen to get your thoughts, especially if you are an existing Hedvig customer – Please comment below.

Slide credit goes to Hedvig and Tech Field Day team.

P.S. You can find all the TFD and SFD presentations about Hedvig via the link here.

Chan

Dropbox’s Magic Pocket: Power of Software Defined Storage

Background

Dropbox is one of the poster boys of the modern-day tech start-ups, similar to the Uber’s and the Netflix’s of the world that was founded by engineers using their engineering prowess to help consumers around the world address various day to day challenges using technologies in a novel way. So, when I was informed that not only Dropbox would be presenting at the SFD15, but we’d also get to tour their state of the art data center, I was ecstatic. (perhaps ecstatic is an understatement!). I work with various technology vendors, from large vendors like Microsoft, Amazon, VMware, Cisco, NetApp…. etc to little known start-ups and Dropbox’s name is often mentioned in event keynote speeches, case studies…etc by most of these vendors as a perfect example of how a born in the cloud organisation can use modern technology efficiently. Heck, they are even referenced in some of the AWS training courses I’ve come across on Pluralsight that talk about Drobox’s ingenious way of using AWS S3 storage behind the scene to store file data content.

So, when I learned that they have designed and built their own Software Defined Storage solution to bring back most of their data storage from AWS on to their on data centres, I was quite curious to find out more details of the said platform and the reasoning behind the move back to on-premises. Given it’s the first time their engineering team openly discussed things, I was looking forward to talking their engineering team at the event.

This post summarises what I learnt from the Dropbox team.

Introduction

I don’t think it’s necessary to introduce Dropbox to anyone these days. If, however you’ve been under a rock for the past 4 years, Dropbox is the pioneering tech organisation from the Silicon Valley that built an online content sharing and a collaboration platform that allows you to synchronise content between various end user devices automatically while letting you access them on any device, anywhere. During this process of data synchronisation and content sharing, they are dealing with,

  • 500+ million users
  • 500+ Petabytes of data storage
  • 750+ billion API calls handled

When they first went live, Dropbox used AWS’s S3 storage (PaaS) to store the actual user file data behind the scene, while their own web servers were used to host the metadata about those files and users. However, as their data storage requirements grew, the necessity to change this architecture was starting to outweigh the benefits such as the agility and ease provided by leveraging AWS cloud storage. As such, Dropbox decided to bring this file storage back in to their own data center on-premises. Dropbox states 2 unique reasons behind this decision: Performance requirements and the raw storage costs. Given the unique use case they have for block storage at extremely high scale, by designing a tailor-made cloud storage solution of their own engineered to provide maximum performance at the lowest unit cost, Dropbox was planning on saving a significant amount of operational costs. As a private company that is about to go in to a public IPO, saving costs was obviously high on their agenda.

Magic Pocket: Software Architecture

While the original name came from an old internal nick name to Dropbox itself, Magic Pocket (MP) now refers to their custom built, internally hosted, software defined, cloud storage infrastructure that is now used by Dropbox to host majority of their user’s file data. This is multi-exabytes in size, with data being fully replicated for availability and has a high data durability (12 x 9’s) and high availability (4 x 9’s).

Within the MP architecture, files are stored in to blocks and replicated across their geo boundaries within their internal infrastructure (back end storage nodes) for durability and availability. The data stored in the MP infrastructure consist of 4mb blocks that are immutable by design. Changes to the data in the blocks are tracked through a File Journal that is part of the metadata held on the Dropbox application servers. Due to the temporal locality of the data, bulk of the static data that are cold, are stored on high capacity, high latency but cheap spinning drives while meta data, cache data & DB’s are kept on high performance low latency, but expensive SSDs.

Unlike most enterprise focused Software Defined Storage (SDS) solutions that utilises some kind of quorum style consensus or distributed coordination to ensure data availability and integrity, MP utilises a simple, centralised, sharded MySQL cluster which is a bit of surprise. Data redundancy is made available through…yeah you guessed it! Customised Erasure coding, similar to many other enterprise SDS solutions however. Data is typically replicated at 1GB chunks (known as buckets) that consist of random, often contiguous 4K blocks. A bucket would replicate or Erasure coded across multiple physical servers (storage nodes) and a set of 1 or more buckets replicated to a set of nodes makes up a volume. This architecture is somewhat similar to how the enterprise SDS vendor Hedvig store their data in the back end.

In Dropbox’s SDS architecture, a pocket is similar to a fault domain in other enterprise SDS solutions and is a geographical zone (US east, US west & US Central for example). Each zone has a cluster of storage servers and other application servers and data blocks are replicated across multiple zones for availability. Pretty standard stuff so far.

Dropbox has a comprehensive Edge network which is geographically dispersed across the world to funnel all customer Drobox application’s connectivities through. The client connectivity path is Application (on user device) -> local pop (proxy servers in an edge location) > Block server > Magic Pocket infrastructure servers > Storage nodes. While the proxy servers in edge locations don’t store any caching of data and can almost be thought of as typical Web servers the clients connect through, the other servers such as Block/MP/Storage nodes servers are ordinary X86 servers stores within Dropbox’s own DCs. These servers are multi sourced as per best practise, and somewhat customised for Dropbox’s specific requirements, especially when it comes to storage node servers. Storage nodes are customised, high density, storage nodes with a capacity to have around 1PB of raw data in each server using local disks. All servers run a generic version of Ubuntu and runs bare metal rather than as VM’s.

Inside each zone, application servers such as Block & Magic Pocket app & db servers act as gateways for storage requests coming through the edge servers. These also hosts the meta data mapping for block placement (block index) in the backend and runs sharded MySQL clusters to store this information (running on SSD storage). Cross zone replication is also initiated in an asynchronous manner within this tier.

A cell is a logical entity of physical storage servers (a cluster of storage nodes) and that defines the core of the Dropbox’s proprietary storage backend which is worth a closer look. These have very large local disks and each storage server (node) consist of around 1PB of storage. These nodes are used as dumb nodes for block level data storage. Replication table, which runs in memory as a small MySQL DB stores the mapping of logical Bucket <-> Volume <-> Storage nodes. This is also part of the metadata stack and is stored on app / db servers with SSD storage.

Master is the software component within each cell that is acting as a janitor and performs back end tasks such as storage node monitoring, creating storage buckets, and other background maintenance operations. However the Master is not on the data plane so doesn’t affect the immediate data read / write operations. There’s a 1:1 mapping between master : Cell. Volume manager (another software component) can be thought of as the data movers / heavy lifters responsible for handling instructions from Master and performing operations accordingly on the storage nodes. Volume manager runs on the actual storage nodes (Storage servers) in the back end.

The front end (interface to the SDS platform) supports simple operations such as Put, Get and Repair. (Details of how this works can be found here)

Magic Pocket: Storage Servers

Dropbox’s customized, high density storage servers make up the actual back end storage infrastructure. Typically each server has a 40GB NIC, around 90 x high capacity enterprise SATA drives as local disks totalling up to around 1PB of raw space per node, runs a bare metal Ubuntu Linux with the Magic Pocket SDS application code and their life cycle management is heavily automated using proprietary and custom built tools. This set up provides a significantly large fault domain per each storage node given the huge capacity of each, but the wider SDS application and network load balancing capabilities architected in the application itself ensure mitigate or design against a complete failures of each server or a cell. We were treated to a scene of observing how this works in action when these engineering team decided to randomly pull the networking cables out while we were touring the DC, and then also cut the power to a full rack which had zero impact on the normal operations of Dropbox’s service. That was pretty cool to see.

My thoughts

Companies like Dropbox inspire me to think outside of the box when it comes to what is possible and how to address modern day business requirements using innovative ways using technology. Similar to the session on Open19 project (part of the Open Compute Project) from the LinkedIn engineering team during SFD12 event last year, this session has also hugely inspired me about the power of software & Hardware engineering and, the impact initiatives like this can have on the wider IT community at large, that we all live and breathe.

As for the Magic pocket SDS & HW architecture… I am a big fan and its great to see organisations such as Dropbox and Netflix (CDN architecture) who epitomises extreme ends of certain use cases, publicly opening up about the backend IT infrastructure that are powering their solutions so that 99% of the other enterprise IT folks can learn and adapt from those blueprints where relevant.

It is also important to remember though, for normal organisations with typical enterprise IT requirements, such custom-built solutions will not be practical nor would they be required and often, the best they’d need can be met with a similarly architected, commercially available Software Defined Storage solution and tailor to meet their requirements. The most important part here though is to realise the power of Software Defined Storage here. If Dropbox can meet their extreme storage requirements through a Software Defined Storage solution that operate on a lower cost premium than a proprietary storage solution, the average corporate or enterprise storage use cases do not have any excuse to keep buying expensive SAN / NAS hardware with a premium price tag. Most enterprise SDS storage solutions (VMware vSAN, Nutanix, Hedvig, Scality…etc.) all have a very similar software and a hardware architecture to that of Dropbox’s and carries a lower cost price point compared to expensive hardware centric storage solutions from the big vendors like EMC, NetApp, HPe, IBM…etc. So why not look in to a SDS solution to if your SAN / NAS is up for a renewal? You can very likely save significant costs and at the same time, benefit from a software defined innovation which tends to comes quicker when there’s no proprietary hardware baggage.

Given Dropbox’s unique scale and storage size, they’ve made a conscious decision to move away for the majority of their storage requirements from AWS (S3 storage) as it they’ve gone past the point where using cloud storage was not economical nor performant enough. But it is also important to remember that they only got to that point through the growth of their business which at the beginning, was only enabled by the agility provided by the very same AWS S3 cloud storage platform they decided to move away from. Most organisations out there are nowhere near the level of scale like Dropbox and therefore its important to remember that for your typical requirements, you can benefit significantly through the clever use of cloud technologies, especially PaaS technologies such as AWS S3, AWS Lambda, Microsoft O365, Azure SQL that provide a ready to use technology solutions platform without you having to build it all from the scratch. In most cases, that freedom and the speed of access can be a worthy trade-off for a slightly higher cost.

Keen to get your thoughts – get involved via comments button below!

Image credit goes to Dropbox!

Chan

Storage Field Day 15 – Watch Live Here

Following on from my previous post about the vendor line-up and my plans during the event, this post is to share the exact vendor presentation schedule and some additional details.

Watch LIVE!

Below is the live streaming link to the event on the day if you’d like to join us LIVE. While the time difference might make it a little tricky for some, it is well worth taking part in as all the viewers will also have the chance to ask questions from the vendors live, similar to the delegates onset. Just do it, you won’t be disappointed!

Session Schedule

Given below is the session schedule throughout the event, starting from Wednesday the 7th. All times are in Pacific time (-8 hours from UK time)

Wednesday the 7th of March

    • 09:30 – 11:30 (5:30-7:30pm UK time) – WekaIO presents
    • 13:00 – 15:00 (9-11pm UK time) – IBM presents
    • 16:00 – 18:00 (12-2am 8th of March, UK time) Dropbox presents

Thursday the 8th of March

  • 08:00-10:00 (4-6pm UK time) – Hedvig presents from their Santa Clara offices
  • 10:30-12:30 (6:30-8:30pm UK time) NetApp presents from their Santa Clara offices
  • 13:30-15:30 (9:30-11:30pm UK time) – Western Digital/Tegile presents from Levi’s Stadium
  • 16:00-18:00 (12-2am 9th of March, UK time) – Datrium presents from Levi’s Stadium

Friday the 9th of March

  • 08:00-10:00 (4-6pm UK time) – StarWinds presents in the Seattle Room
  • 11:00-13:00 (7-9pm UK time) – Cohesity presents at their San Jose offices
  • 14:00-16:00 (10pm-12am UK time) – Huawei presents at their Santa Clara offices

Storage Field Day 15 – Introduction

Having attended the Storage Field Day 12 edition back in 2017, I was really looking forward to attending this popular event again. Luckily I’ve been invited again to attend the Storage Field Day 15 with an exciting line up of various enterprise storage vendors. This post is a quick intro about the event and the schedule ahead.

SFD – Quick Introduction!

Storage Field Day is a part of the popular, invitees only, Tech Field Day series of events organised and hosted by Gestalt IT (GestaltIT.com). The genius idea behind the event is to bring together innovative technology solutions from various vendors (The “Sponsors”) who will be presenting their solutions live to a room full of independent technology bloggers and thought leaders (The “delegates”), chosen from around the world based on their knowledge, community profile and thought leadership, in order to get their independent thoughts (good or bad) of the said solutions. The event is also streamed live worldwide for anyone to tune in to and is often used by various technology start-ups to announce their arrival to the mainstream markets.

There are various different field day events organised by Gestalt IT such as Tech / Storage / Cloud / Mobility / Networking / Virtualisation / Unified Communications / Wirelesss Field Day events that take place throughout the year with respective technology vendor solutions showcased in each. It’s organised by the chief organiser Stephen Foskett (@SFoskett) and has always been an extremely popular event amongst the vendors as it provides an ideal opportunity for them to present their new products and solutions to a number of thought leaders and community influencers from around the world and get their valuable thoughts & feedback. It is equally popular amongst the attending delegates who gets the opportunity, not only to witness brand new technology at times, but also be able to critique and express their valuable feedback in front of these vendors.

During each day, the delegates and the organisers (Steve along with few of the supporting crew members such as the camera crew) would travel to each of the participating vendors offices (often in the Silicon Valley) where the technology & business leaders would be presenting there’s solutions to the audience. Typically, this whole session is streamed live thanks to the SFD camera crew and each session is also recorded to be posted on to various video sharing sites for subsequent viewings. I would seriously recommend have a look at their YouTube channel https://www.youtube.com/channel/UCSnTTyp4q7jMhwECxXzMAXQ/featured for past session recordings if you are a technology person with a keen interest in enterprise technology solutions.

SFD15 – Schedule & Vendor line-up

SFD15 is due to take place in the Silicon Valley between the 7-9th of March 2018. From what I understood, SFD15 has been a fiercely competitive event in terms of the sponsor slots by various vendors due to high popularity and I’ve noticed the participating vendor list changing few times up until now, presumably due to increasing competition. The list of vendors confirmed (as of today-17.02.2018) are as follows

  • Cohesity:               Cohesity is a relatively new, secondary storage vendor that I’ve been keeping a close eye on for a while now. I know their offering and their value proposition fairly well and am looking forward to understanding what’s new and their future plans.
  • Dropbox:               This is big and for the first time ever, Dropbox are publicly announcing their home grown, highly customised storage infrastructure that runs hand in hand with their own Software Defined Storage solution that together stores all of our data behind the scene. This is going to be epic. So DO NOT miss out on this one.
  • Hedvig:                   Hedvig is a Software Defined Storage company started about 5 years ago in the Silicon Valley by an ex engineer with distributed file system and database design experience at AWS and Facebook who invented Amazon DynamoDB & Cassandra. I’ve reviewed Hedvig’s technology quite closely in the past and have liked what I’ve seen. So, I’m very keen to see what they have to say this time around their latest offering and the future plans.
  • Datrium:                While I had come across these guys before, I didn’t know the details of their offering other than that they claim to be an (better) alternative typical converged and Hyper-Converged solutions out there. So looking forward to finding out a bit more about their offering.
  • NetApp:                  NetApp is a regular delegate at most Tech Field Day events and this time around, they are likely going to be presenting their overall storage strategy (A good one that I’m already familiar with) with possibly a major focus on their HCI offering (which I’m not fully sold on as a credible HCI offering compared to competition, but thought of as more a CI solution). So keen to find out a bit more to clarify on this front. NetApp is a company I work with very closely and have known for years and its always a pleasure to be able to visit their HO in Sunnyvale.
  • HGST:                      This will likely be the Tegile offering that was procured by Western Digital. Looking forward to this.
  • Huawei:                   Huawei is the popular Chinese data center IT behemoth that is growing in popularity each year. Incidentally, Huawei is a vendor partner of company and I know they have a growing presence in the UK public sector due to their lower costs on infrastructure components. While they manufacture everything from servers to software Defined Networking controllers, I’d presume the focus of their session during SFD15 would be on their storage offering. I am personally not fully familiar with Huawei’s storage offerings so this could be a great opportunity for me to right that wrong.
  • IBM:                           No introduction necessary for this one . Not sure what IBM storage solution they’ll be covering during SFD15.
  • StarWind:               StarWind who is a HCI solution vendor, presented in the last SFD12 I attended and its good to see them come back to SFD15 again. Keen to get an updated view on things.
  • WEKA.IO:               These guys claim to have the world’s fasted parallel file system and are totally new to me. So I am really looking forward to finding out the details of their offering and its value proposition.

My Plan

I will be travelling to San Jose from London Heathrow on the 6th with a view to catching up with this year’s delegates over the evening meal on the 6th which is always fun. There are some familiar faces from last years SFD12 as well as few new faces (to me) so looking forward to meeting these girls and boys. I am due to fly back to London on Monday the 12th and I expect to gather some intimate knowledge of these solutions. I am aiming to publish some posts on most of the interesting solutions I come across to provide a deep dive and my independent thoughts on them, during and after the event.

If you are interested in attending a future TFD / SFD event, all the information you need to know can be found here at http://techfieldday.com/delegates/become-field-day-delegate/.

If you are interested in watching these sessions / presentations live, you can find the schedule and the live stream link information at http://techfieldday.com/event/sfd15/. I would seriously encourage you to watch the event live and do take part in the live questions as well.

SFD12 Posts

As mentioned above, I learnt a lot during the SFD12 participation last year about the storage industry in general as well about the direction of a number of storage vendors. If you are interested in finding out more, see my #SFD12 articles below

Apple WWDC 2017 – Artificial Intelligence, Virtual Reality & Mixed Reality

Introduction

As a technologist, I like to stay close to key new developments & trends in the world of digital technology to understand how these can help users address common day to day problems more efficiently. Digital disruption and technologies behind that such as Artificial Intelligence (AI), IoTVirtual Reality (VR), Augmented Reality (AR) & Mixed Reality (MR) are hot topics as they have the potential to significantly reshape how consumers will consume products and services going forward. I am a keen follower on these disruptive technologies because the potential impact they can have on traditional businesses in an increasingly digital, connected world is huge in my view.

Something I’ve heard today coming out of Apple, the largest tech vendor on the planet about how they intend on using various AI technologies along with VR and AR technologies in their next product upgrades across the iPhone, iPad, Apple Watch, App store, Mac…etc made me want to summarise those announcements and add my thoughts on how Apple will potentially lead the way to mass adoption of such digital technologies by many organisations of tomorrow.

Apple’s WWDC 2017 announcements

I’ve been an Apple fan since the first iPhone launch as they have been the prime example when it comes to tech vendors who utilizes cutting edge IT technologies to provide an elegant solution to address day to day requirements in a simple and effective manner that providers a rich user experience. I practically live on my iPhone every day for work and non-work related activities and also appreciate their other ecosystem products such as the MacBook, Apple watch, Apple TV and the iPad. This is typically not because they are technologically so advanced, but simply because they provide a simple, seamless user experience when it comes to using them to increase my productivity during day to day activities.

So naturally I was keen on finding out about the latest announcements that came out of Apple’s latest World Wide Developer Conference event that was held earlier today in San Jose. Having listened to the event and the announcements, I was excited by the new product and software upgrades announced but more than that, I was super excited about couple of related technology integrations Apple are coming out with which include a mix of AI, VR & AR to provide an even better user experience by integrating these technology advancements in to their product offerings.

Now before I go any further, I want to highlight this is NOT a summary of their new product announcements. What interested me out of these announcements were not so much the new apple products, but mainly how Apple, as a pioneer in using cutting edge technologies to create positive user experiences like no other technology vendor on the planet, are going to be using these potentially revolutionary digital technologies to provide a hugely positive user experience. This is relevant to every single business out there that manufacture a product, provides a service or solution offering to their customers as anyone can potentially look to incorporate the same capabilities in a similar or even a more creative and an innovative manner than Apple to provide a positive user experience in a similar fashion.

Use of Artificial Intelligence

Today Apple announced the increased use of various AI technologies everywhere within the future apple products as summarised below

  • Increased use of Artificial Intelligence technologies by the personal assistant “Siri”, to provide a more positive & a more personalised user experience
    • In the upcoming version of the Watch OS 4 for Apple Watch, AI technologies such as Machine Learning is going to be used to power the new Siri face such that Siri can now provide you with dynamic updates that are specifically relevant to you and what you do (context awareness)
    • The new iOS 11 will include a new voice for Siri, which now uses Deep Learning Technologies (AI) behind the scene to offer a more natural and expressive voice that sounds less machine and more human.
    • Siri will also use Machine Learning on each device (“On device learning”) to understand specifically what’s more relevant to you based on what you do on your device so that more personalised interactions can be made by Siri – In other words, Siri is becoming more context aware thanks to Machine Learning to provide a truly personal assistant service unique to each user including predictive tips based on what you are likely to want to do / use next.
    • Siri will use Machine Learning to automatically memorise new words from the content you read (i.e. News) so these words are now included on the dictionary & predictive texts automatically if you want to type them

 

  • Use of Machine Learning in iOS 11 within the photo app to enable various new capabilities to make life easier with your photos
    • Next version of Apple Mac OS, code named High Sierra, will supports additional features on the photo app including advanced face recognition capabilities which utilises AI technologies such as Advanced convolution Neural networks in order to let you group / filter your photos actually based on who’s on them
    • Machine learning capabilities will also be used to automatically understand the context of each photo based on the content of the photo to identify photos from events such as sporting events, weddings…etc and automatically group them / create events / memories
    • Using computer vision capabilities to create seamless loops on live photos
    • Use of Machine Learning to activate palm rejection on the iPad during writing using the apple Pen
    • Most Machine Learning capabilities are now available for 3rd party programmers via the iOS API’s such as Vision API (enables iOS app developers harness machine learning for face tracking, face detection, landmarks, text detection, rectangle detection, barcode detection, object tracking, image registration), Natural Language API (provides language identification, tokenization, lemmatisation, part of speech, named entity recognition)
    • Introduction of Machine Learning Model Converter, 3rd party ML contents can be converted to native iOS 11 Core ML functions.

 

  • Use of Machine Learning to improve graphics on iOS 11
    • Another Mac OS high sierra updates will include Metal 2 (the Apple API that provides app developers near direct access to the GPU capabilities) that will now integrate Machine Learning to graphic processing to provide advanced graphical capabilities such as Metal performance shaders, Recurrent neural network kernels, binary convolution, dilated convolution, L-2 norm pooling, Dilated pooling etc. (https://developer.apple.com/metal/)
    • Newly announced Mac Pro graphics powered by AMD Radeon Vega can provide up to 22 teraflops of half precision compute power which is specifically relevant for machine learning related content development

 

Use of Virtual Reality & Augmented Reality

  • Announcement on the introduction of Metal API for Virtual Reality to be used by developers – That includes Virtual Reality integration to Mac OS High Sierra Metal2 API to enable features such as VR-optimised display pipeline for video editing using VR and other related updates such as viewport arrays, system trace stereo timelines, GPU queue priorities, Frame debugger stereoscopic visualisation.
  • Availability of the ARKit for iOS 11 to create Augmented Reality straight out of iPhone using its camera and built in Machine Learning to identify contents on the live video, real time.

 

Use of IoT capabilities

  • Apple Watch integration for bi directional information synchronisation between Apple Watch and ordinary gym equipment so that your apple watch will now act as an IoT gateway to your typical gym equipment’s like the Treadmill or the cross trainer to provide more accurate measurements from the apple watch and the gym equipment will adjust the workouts based on those readings.
  • Apple watch OS 4 will also provide core Bluetooth connectivity to other devices such as various healthcare tools that open the connectivity of those devices through the Apple Watch

 

My thoughts

The use cases for digital technologies, such as AI, AR & VR in a typical corporate or an enterprise environment to create a better product / service / solution offering like Apple has used them, is immense and is often only limited by one’s level of creativity & imaginations. Many organisations around the world, from other tech or product vendors to Independent Software Vendors to an ordinary organisation like a high street shop or a supermarket can all benefit from the creative application of new digital technologies such as Artificial Intelligence, Augmented Reality and Internet Of Things in to their product / service / solution offerings to provide their customers with richer user experience as well as exciting new solutions. Topics like AI and AR are hot topics in the industry and some organisations are already evaluating the use of them while some already benefit from some of these technologies made easily accessible to the enterprise through platforms such as public cloud (Microsoft Cortana Analytics and Azure Machine Learning capabilities available on Microsoft Azure for example) platforms. But there are also a large number of organisations who are not yet fully investigating how these technologies can potentially make their business more innovative, differentiated or at the very least, more efficient.

If you belong to the latter group, I would highly encourage you to start thinking about how these technologies can be adopted by your business creatively. This applies to any organisation of any size in the increasingly digitally connected world of today. If you have a trusted partner for IT, I’d encourage you to talk to them about the same as the chances are that they will have more collective experience in helping similar businesses adopt such technologies which is more beneficial than trying to get their on your own, especially if you are new to it all.

Digital disruption is here to stay and Apple have just shown how advanced technologies that come out of digital disruption can be used to create better products / solutions for average customers. Pretty soon, AI / AR / IoT backed capabilities will become the norm rather than the exception and how would your business compete if you are not adequately prepared to embrace them?

Keen to get your thoughts?

 

You can watch the recorded version of the Apple WWDC 2017 event here.

Image credit goes to #Apple

 

#Apple #WWDC #2017 #AI #DeepLearning #MachineLearning #ComputerVision #IoT #AR #AugmentedReality #VR #VirtualReality #DigitalDisruption #Azure #Microsoft

 

Impact from Public Cloud on the storage industry – An insight from SNIA at #SFD12

As a part of the recently concluded Storage Field Day 12 (#SFD12), we traveled to one of the Intel campuses in San Jose to listen to the Intel Storage software team about future of storage from an Intel perspective (You can read all about here).  While this was great, just before that session, we were treated to another similarly interesting session by SNIA – The Storage Networking Industry Association and I wanted to brief everyone on what I learnt from them during that session which I thought was very relevant to everyone who has a vested interest in field of IT today.

The presenters were Michael Oros, Executive Director at SNIA along with Mark Carlson who co-chairs the SNIA technical council.

Introduction to SNIA

SNIA is a non-profit organisation that was formed 20 years ago to deal with inter-operability challenges of network storage by various different tech vendors. Today there are over 160 active member organisations (tech vendors) who work together behind closed doors to set standards and improve inter-operability of their often competing tech solutions out in the real world. The alphabetical list of all SNIA members are available here and the list include key network and storage vendors such as Cisco, Broadcom, Brocade, Dell, Hitachi, HPe, IBM, Intel, Microsoft, NetApp, Samsung & VMware. Effectively, anyone using any and most of the enterprise datacenter technologies have likely benefited from SNIA defined industry standards and inter-operability

Some of the existing storage related initiatives SNIA are working on include the followings.

 

 

Hyperscaler (Public Cloud) Storage Platforms

According to SNIA, Public cloud platforms, AKA Hyperscalers such as AWS, Azure, Facebook, Google, Alibaba…etc are now starting to make an impact on how disk drives are being designed and manufactured, given their large consumption of storage drives and therefore the vast buying power. In order to understand the impact of this on the rest of the storage industry, lets clarify few key basic points first on these Hyperscaler cloud platforms first (for those didn’t know)

  • Public Cloud providers DO NOT buy enterprise hardware components like the average enterprise customer
    • They DO NOT buy enterprise storage systems (Sales people please read “no EMC, no NetApp, No HPe 3par…etc.”)
    • They DO NOT buy enterprise networking gear (Sales people please read “no Cisco switches, no Brocade switches, HPe switches…etc”.)
    • They DO NOT buy enterprise servers from server manufacturers (Sales people please read “no HPe/Dell/Cisco UCS servers…etc.)
  • They build most things in-house
    • Often this include servers, network switches…etc
  • They DO bulk buy disk drives direct from the disk manufacturers & uses home grown Software Defined Storage techniques to provision that storage.

Now if you think about it, large enterprise storage vendors like Dell and NetApp who normally bulk buy disk drives from manufacturers such as Samsung, Hitachi, Seagate…etc would have had a certain level of influence over how their drives are made given the economies of scale (bulk purchasing power) they had. However now, Public cloud providers who also bulk buy, often quantities far bigger than those said storage vendors would have, also become hugely influential over how these drives are made, to the level that their influence is exceeding that of those legacy storage vendors. This influence is growing such that they (Public Cloud providers) are now having a direct input towards the initial design of the said components (i.e disk drives…etc.) and how they are manufactured, simply due to the enormous bulk purchasing power as well as the ability they have to test drive performance at a scale that was not even possible by the drive manufacturers before,  given their global data center footprint.

 

Expanding on the focus these guys have on Software Defined storage technologies to aggregate all these disparate disk drives found in their servers in the data center is inevitably leading to various architectural changes in how the disk drives are required to be made going forward. For example, most legacy enterprise storage arrays would rely on the old RAID technology to rebuild data during drive failures and there are various background tasks implemented in the disk drive firmware such as ECC & IO re-try operations during failures which adds to the overall latency of the drive. However with modern SDS technologies (in use within Public Cloud platforms as well as some new enterprise SDS vendors tech), there are multiple copies of data held on multiple drives automatically as a part of the Software Defined Architecture (i.e. Erasure Coding) which means those specific background tasks on disk drives such as ECC, and re-try mechanism’s are no longer required.

For example, SNIA highlighted Eric Brewer, the VP of infrastructure of Google who talked about the key metrics for a modern disk drive to be,

  • IOPS
  • Capacity
  • Lower tail latency (long tail of latencies on a drive, arguably caused due to various background tasks, typically causes a 2-10x slower response time from a disk in a RAID group which causes a disk & SSD based RAID stripes to experience at least a single slow drive 1.5%-2.2% of the time)
  • Security
  • Lower TCO

So in a nutshell, Public cloud platform providers are now mandating various storage standards that disk drive manufacturers have to factor in to their drive design such that the drives are engineered from ground up to work with Software Defined architecture in use at these cloud provider platforms.  What this means most native disk firmware operations are now made redundant and instead the drive manufacturer provides an API’s through which cloud platform provider’s own software logic will control those background operations themselves based on their software defined storage architecture.

Some of the key results of this approach includes following architectural designs for Public Cloud storage drives,

  • Higher layer software handles data availability and is resilient to component failure so the drive firmware itself doesn’t have to.
    • Reduces latency
  • Custom Data Center monitoring (telemetry), and management (configuration) software monitors the hardware and software health of the storage infrastructure so the drive firmware doesn’t have to
    • The Data Center monitoring software may detect these slow drives and mark them as failed (ref Microsoft Azure) to eliminate the latency issue.
    • The Software Defined Storage then automatically finds new places to replicate the data and protection information that was on that drive
  • Primary model has been Direct Attached Storage (DAS) with CPU (memory, I/O) sized to the servicing needs of however many drives of what type can fit in a rack’s tray (or two) – See the OCP Honey Badger
  • With the advent of higher speed interfaces (PCI NVMe) SSDs are moving off of the motherboard onto an extended PCIe bus shared with multiple hosts and JBOF enclosure trays – See the OCP Lightning proposal
  • Remove the drives ability to schedule advanced background operations such as Garbage collection, Scrubbing, Remapping, Cache flushes, continuous self tests…etc on its own and allow the host to affect the scheduling of these latency increasing drive maintenance operations when it sees fit – effectively remove the drive control plane and move it up to the control of the Public Cloud platform (SAS = SBC-4 background Operation Control, SATA = ACS-4 advanced background operaitons feature set, NVMe = Available through NVMe sets)
    • Reduces unpredictable latency fluctuations & tail latency

The result of all these means Public Cloud platform providers such as Microsoft, Google, Amazon are now also involved at setting industry standards through organisations such as SNIA, a task previously only done by hardware manufacturers. An example is the DePop standard which is now approved at T13 which essentially defines a standard where the storage host will shrink the usable size of the drive by removing the poor performing (slow) physical elements such as drive sectors from the LBA address space rather than disk firmware. The most interesting part is that the drive manufacturers are now required to replace the drives when enough usable space has shrunk to match the capacity of a full drive, without necessarily having the old drive back (i.e. Public cloud providers only pay for usable capacity and any unusable capacity is replenished with new drives) which is a totally different operational and a commercial model to that of legacy storage vendors who consume drives from drive manufacturers.

Another concept that’s pioneered by the Public cloud providers is called Streams which maps lower level drive blocks with an upper level object such as a file that reside on it, in a way that all the blocks making the file object are stored contiguously. This simplifies the effect of a TRIM or a SCSI UNMAP command (executed when the file is deleted from the file system) which reduces delete penalty and causes lowest amount of damage to SSD drives, extending their durability.

 

Future proposals from Public Cloud platforms

SNIA also mentioned about future focus areas from these public cloud providers such as,

  • Hyperscalers (Google, Microsoft Azure, Facebook, Amazon) are trying to get SSD vendors to expose more information about internal organization of the drives
    • Goal to have 200 µs Read response and 99.9999% guarantee for NAND devices
  • I/O Determinism means the host can control order of writes and reads to the device to get predictable response times –
    • Window of reading – deterministic responses
    • Window of writing and background – non-deterministic responses
  • The birth of ODM – Original Design Manufacturers
    • There is a new category of storage vendors called Original Design Manufacturer (ODM) direct who package up best in class commodity storage devices into racks according to the customer specifications and who operate at much lower margins.
    • They may leverage hardware/software designs from the Open Compute Project (OCP) or a similar effort in China called Scorpio, now under an organization called the Open Data Center Committee (ODCC), as well from as other available hardware/software designs.
    • SNIA also mentioned about few examples of some large  global enterprise organisations such as a large bank taking the approach of using ODM’s to build a custom storage platform achieving over 50% cost savings over using traditional enterprise storage

 

My Thoughts

All of these Public Cloud platform introduced changes are set to collectively change the rest of the storage industry too and how they fundamentally operate which I believe would be good for the end customers. Public cloud providers are often software vendors who approaches every solution with a software centric solution and typically, would have highly cost efficient architecture of using cheapest commodity hardware with underpinned by intelligent software. This will likely re-shape the legacy storage industry too and we are already starting to see the early signs of this today through the sudden growth of enterprise focused Software Defined Storage vendors and legacy storage vendors struggling with their storage revenue. All public cloud computing and storage platforms are a continuous evolution for the cost efficiency and each of their innovation in how storage is designed, built & consumed will trickle down to the enterprise data centers in some shape or form to increase overall efficiencies which surely is only a good thing, at least in my view. And smart enterprise storage vendors that are software focused, will take note of such trends and adopt accordingly (i.e. SNIA mentioned that NetApp for example, implemented the Stream commands on the front end of their arrays to increase the life of the SSD media), where as legacy storage / hardware vendors who are effectively still hugging their tin, will likely find the future survival more than challenging.

Also, the concept of ODM’s really interest me and I can see the use of ODM’s increasing further as more and more customers will wake up to the fact that they have been overpaying for their storage for the past 10-20 years in the data center due to the historically exclusive capabilities within the storage industry. With more of a focus on a Software Defined approach, there are large cost savings to be had potentially through following the ODM approach, especially if you are an organisation of that size that would benefit from the substantial cost savings.

I would be glad to get your thoughts, through comments below

 

If you are interested in the full SNIA session, a recording of the video stream us available here and I’d recommend you watch it, especially if you are in the storage industry.

 

Thanks

Chan

P.S. Slide credit goes to SNIA and TFD

VMware & DataGravity Solution – Data Management For the Digital Enterprise

 

 

Yesterday, I had the priviledge to be invited to an exclusive VMware #vExpert only webinar oraganised by the vExpert community manager, Corey Romero and DataGravity, one of their partner ISV’s to get a closer look at the DataGravity solution and its integration with VMware.  My initial impression was that its a good solution and a good match with VMware technology too and I kinda like what I saw. So decided to post a quick post about it to share what I’ve learned.

DataGravity Introduction

DataGravity (DG from now on) solution appear to be all about data managament, and in perticular its about data management in a virtualised data center. In a nutshell, DG is all about providing a simple, virtualisation friendly data management solution that, amongst many other things, focuses on the following key requirements which are of primary importance to me.

  • Data awareness – Understand different types of data available within VMs, structured or unstructured along with various metadata about all data. It automatically keeps a track of data locations, status changes and various other metadata information about data including any sensitive contents (i.e. Credit card information) in the form of an easy to read, dashboard style interface
  • Data protection & security –  DG tracks sensitive data and provide a complete audit trail including access history helpo remediate any potential loss or compromise of data

DG solution is currently specific to VMware vSphere virtual datacenter platforms only and serves 4 key use cases as shown below

Talking about the data visulation itself, DG claim to provide a 360 degree view of all the data that reside within your virtualised datacenter (on VMware vSphere) and having see the UI on the live demo, I like that visualisation of it which very much resemblbes the interface of VMware’s own vrealise operations screen.

The unified, tile based view of all the data in your datacenter with vROPS like context aware UI makes navigating through the information about data pretty self explanatory.

Some of the information that DG automatically tracks on all the data that reside on the VMware datacenter include information as shown below

Some of the cool capabilities DG has when it comes to data protection itself include behaviour based data protection where it proactively monitor user and file activities and mitigates potential attacks through sensing anomolous behaviours and taking prevenetive measures such as orchestratin protection points, alerting administrators to even blocking user access automatically.

During a recovery scenario, DG claims to assemble the forensic information needed to perform a quick recovery such as cataloging files and incremental version information, user activity information and other key important meta data such as known good state of various files which enable the recovery with few clicks.

Some Details

During the presentaiton, Dave Stevens (Technical Evangelist) took all the vExperts through the DG solution in some detail and its integration with VMware vSphere which I intend to share below for the benefit of all others (sales people: feel free to skip this section and read the next).

The whole DG solution is deployed as a simple OVA in to vCenter and typically requires connecting the appliance to Microsoft Active Directory (for user access tracking) initially as a one off task. It will then perform an automated initial discovery of data and the important thing to note here is that it DOES NOT use an agent in each VM but simply uses the VMware VADP, or now known as vSphere Storage API to silently interrogate data that live inside the VMs in the data center with minimal overhead. Some of the specifics around the overhead around this are as follows

  • File indexing is done at a DiscoveryPoint (Snapshot) either on a schedule or user driven. (No impact to real-time there access from a performance point of view).
  • Real time access tracking overhead is minimal to non existent
    • Real-time user activity is 200k of memory
    • Network bandwidth about 50kbps per VM.
    • Less than 1% of CPU

From an integration point of view, while DG solution integrates with vSphere VM’s as above irrespective of the underlying storage platform, it also has the ability to integrate with specific storage vendors too (licensing prerequisites apply)

Once the data discovery is complete, further discoveries are done on an incremental basis and the management UI is a simple web interface which looks pretty neat.

Similar to VMware vROPS UI for example, the whole UI is context aware so depending on what object you select, you are presented with stats in the context of the selected object(s).

The usage tracking is quite granular and keeps a track of all types of user access for data in the inventory which is handy.


 

Searching for files is simple and you can also use tags to search using, which are simple binary expressions. Tags can be grouped together in to profiles too to search against which looks pretty simple and efficient.

I know I’ve mentioned this already but the simple, intuitive user interface makes consuming the information on the UI about all your data in  singple pane of glass manner looks very attractive.

Current Limitations

There are some current limitations to be aware of however and some of the important ones include,

  • Currently it doesn’t look inside structured data files (i.e. Database files for example)
    • Covers about 600 various file types
  • File content analytics is available for Windows VMs only at present
    • Linux may follow soon?
  • VMC (VMware Cloud on AWS) & VCF (Vmware Cloud Foundation) support is not there (yet)
    • Is this to be annouced during a potential big event?
  • No current availability on other public cloud platforms such as AWS or Azure (yet)

 

My Thoughts

I lilke the solution and its capabilities due to various reasons. Primarily its because the focus on data that reside in your data center is more important now that its ever been. Most organisaitons simply do not have a clue of the type of th data they hold in a datacenter, typically scattered around various server, systrems, applications etc, often duplicated and most importantly left untracked on their current relevence or even the actual usage as to who access what. Often, most data that is generated by an organisation serves its initial purpose after a certain intial period and that data is now simply just kept on the systems forever, intentionally or unintentionally. This is a costly exercise, especially on the storage front and you are typically filling your SAN storage with stale data. With a simple, yet intelligent data management solution like DG, you now have the ability to automatically track data and their ageing across the datacenter and use that awareness of your data to potentially move stale data on to a different tier, especially a cheaper tier such as a public cloud storage platform.

Furthermore, not having an understanding of data governance, especifically not monitoring the data access across the datacenter is another issue where many organisations do not collectively know what type of data is available where within the datacenter and how secure that data is including their access / usage history over their existence. Data security is probably the most important topic in the industry today as organisations are in creasingly becoming digital thanks to the Digital revelution / Digital Enterprise phoenomena (in other words, every organisation is now becoming digital) and a guranteed by product of this is more and more DATA being generated which often include all if not most of an organisations intelectual property. If theres no credible way of providing a data management solution focusing around security for such data, you are risking loosing the livelyhood of your organisation and its potential survival in a fiercely coimpetitive global economy.

It is important to note that some regulatory compliance has always enforced the use of data management & governance solutions such as DG tracking such information about data and their security for certain type of data platforms (i.e.  PCI for credit card information…etc). But the issue is no such requirement existed for all types of data that lives in your datacenter. This about to change, at least here in the Europe now thanks to the European GDPR (General Data Protection Regulation) which now legally oblighes every orgnisation to be able to provide auditeble history of all types of data that they hold and most organisations I know do not have a credible solution covering the whole datacenter to meet such demands rearding their data today.

A simple, easily integrateble solution that uses little overhead like DataGravity that, for the most part harness the capabilities of the underlying infrastructure to track and manage the data that lives on it could be extremely attractive to many customers. Most customers out there today use VMware vSphere as their preferred virtualisaiton platform and the obvious integration with vSphere will likely work in favour of DG. I have already signed up for a NFR download for me to have doiwnload and deploy this software in my own lab to understand in detail how things work in detail and I will aim to publish a detailed deepdive post on that soon. But in the meantime, I’d encourage anyone that runs a VMware vSphere based datacenter that is concerned about data management & security to check the DG solution out!!

Keen to get your thoughts if you are already using this in your organisation?

 

Cheers

Chan

Slide credit to VMware & DataGravity!

Storage Futures With Intel Software From #SFD12

 

As a part of the recently concluded Storage Field Day 12 (#SFD12), we traveled to one of the Intel campuses in San Jose to listen to the Intel Storage software team about future of storage from an Intel perspective. This was a great session that was presented by Jonathan Stern (Intel Solutions Architect /  and Tony Luck (Principle Engineer) and this post is to summarise few things I’ve learnt during those sessions that I thought were quite interesting for everyone. (prior to this session we also had a session from SNIA that was talking about future of storage industry standards but I think that deserves a dedicated post so I won’t mention those here – stay tuned for a SNIA event specific post soon!)

First session from Intel was on the future of storage by Jonathan. It’s probably fair to say Jonathan was by far the most engaging presenter out of all the SFD12 presenters and he covered somewhat of a deep dive on the Intel plans for storage, specifically on the software side of things and the main focus was around the Intel Storage Performance Development Kit (SPDK) which Intel seem to think is going to be a key part of the future of storage efficiency enhancements.

The second session with Tony was about Intel Resource Director Technology (addresses shared resource contention that happens inside an Intel processor in processor cache) which, in all honesty was not something most of us storage or infrastructure guys need to know in detail. So my post below is more focused on Jonathan’s session only.

Future Of Storage

As far as Intel is concerned, there are 3 key areas when it comes to the future of storage that need to be looked at carefully.

  • Hyper-Scale Cloud
  • Hyper-Convergence
  • Non-Volatile memory

To put this in to some context, see the below revenue projections from Wikibon Server SAN research project 2015 comparing the revenue projections for

  1. Traditional Enterprise storage such as SAN, NAS, DAS (Read “EMC, Dell, NetApp, HPe”)
  2. Enterprise server SAN storage (Read “Software Defined Storage OR Hyper-Converged with commodity hardware “)
  3. Hyperscale server SAN (Read “Public cloud”)

It is a known fact within the storage industry that public cloud storage platforms underpinned by cheap, commodity hardware and intelligent software provide users with an easy to consume, easily available and most importantly non-CAPEX storage platform that most legacy storage vendors find hard to compete with. As such, the net new growth in the global storage revenue as a whole from around 2012  has been predominantly within the public cloud (Hyperscaler) space while the rest of the storage market (non-public cloud enterprise storage) as a whole has somewhat stagnated.

This somewhat stagnated market was traditionally dominated by a few storage stalwarts such as EMC, NetApp, Dell, HPe…etc. However the rise of the server based SAN solutions where commodity servers with local drives combined with intelligent software to make a virtual SAN / storage pool (SDS/HCI technologies) has made matters worse for these legacy storage vendors and such storage solutions are projected to eat further in to the traditional enterprise storage landscape within next 4 years. This is already evident by the recent popularity & growth of such SDS/HCI solutions such as VMware VSAN, Nutanix, Scality, HedVig while at the same time, traditional storage vendors announcing reducing storage revenue. So much so that even some of the legacy enterprise storage vendors like EMC & HPe have come up with their own SDS / HCI offerings (EMC Vipr, HPe StoreVirtual, annoucement around SolidFire based HCI solution…etc.) or partnered up with SDS/HCI vendors (EMC VxRail, VxRail…etc.) to hedge their bets against a loosing back drop of traditional enterprise storage.

 

If you study the forecast in to the future, around 2020-2022, it is estimated that the traditional enterprise storage market revenue & market share will be even further squeezed by even more rapid  growth of the server based SAN solutions such as SDS and HCI solutions. (Good luck to legacy storage folks)

An estimate from EMC suggest that by 2020, all primary storage for production applications would sit on flash based drives, which precisely co-inside with the timelines in the above forecast where the growth of Enterprise server SAN storage is set to accelerate between 2019-2022. According to Intel, one of the main reasons behind this forecasted increase of revenue (growth) on the enterprise server SAN solutions is estimated to be the developments of Non-Volatile Memory (NVMe) based technologies which makes it possible achieve very  low latency through direct attached (read “locally attach”) NVMe drives along with clever & efficient software that are designed to harness this low latency. In other words, drop of latency when it comes to drive access will make Enterprise server SAN solutions more appealing to customers who will look at Software Defined, Hyper-Converged storage solutions in favour of external, array based storage solutions in to the immediate future and legacy storage market will continue to shrink further and further.

I can relate to this prediction somewhat as I work for a channel partner of most of these legacy storage vendors and I too have seen first hand the drop of legacy storage revenue from our own customers which reasonably backs this theory.

 

Challenges?

With the increasing push for Hyper-Convergence with data locality, the latency becomes an important consideration. As such, Intel’s (& the rest of the storage industry’s) main focus going in to the future is primarily around reducing the latency penalty applicable during a storage IO cycle, as much as possible. The imminent release of this next gen storage media from Intel as a better alternative to NAND (which comes with inherent challenges such as tail latency issues which are difficult to get around) was mentioned without any specific details. I’m sure that was a reference to the Intel 3D XPoint drives (Only just this week announced officially by Intel http://www.intel.com/content/www/us/en/solid-state-drives/optane-solid-state-drives-dc-p4800x-series.html) and based on the published stats, the projected drive latencies are in the region of < 10μs (sequential IO) and < 200μs (random IO) which is super impressive compared to today’s ordinary NVMe SSD drives that are NAND based. This however presents a concern as the current storage software stack that process the IO through the CPU via costly context switching also need to be optimised in order to truly benefit from this massive drop in drive latency. In other words, the level of dependency on the CPU for IO processing need to be removed or minimised through clever software optimisation (CPU has long been the main IO bottleneck due to how MSI-X interrupts are handled by the CPU during IO operations for example). Without this, the software induced latency would be much higher than the drive media latency during an IO processing cycle which will contribute to an overall higher latency still. (My friend & fellow #SFD12 delegate Glenn Dekhayser described this in his blog as “the media we’re working with now has become so responsive and performant that the storage doesn’t want to wait for the CPU anymore!” which is very true).

Furthermore,

Storage Performance Development Kit (SPDK)

Some companies such as Excelero are also addressing this CPU dependency of the IO processing software stack by using NVMe drives and clever software  to offload processing from CPU to NVMe drives through technologies such as RDDA (Refer to the post I did on how Excelero is getting around this CPU dependency by reprogramming the MSI-X interrupts to not go to the CPU). SPDK is Intel’s answer to this problem and where as Excelero’s RDDA architecture primarily avoid CPU dependency by bypassing CPU for IOs, Intel SPDK minimizes the impact on CPU & Memory bus cycles during IO processing by using the user-mode for storage applications rather than the kernel mode, thereby removing the need for costly context switching and the related interrupt handling overhead. According to http://www.spdk.io/, “The bedrock of the SPDK is a user space, polled mode, asynchronous, lockless NVMe driver that provides highly parallel access to an SSD from a user space application.”

With SPDK, Intel claims that you can reach up to around 3.6million IOPS per single Xeon CPU core before it ran out of PCI lane bandwidth which is pretty impressive. Below is a IO performance benchmark based on a simple test of CentOS Linux kernel IO performance (Running across 2 x Xeon E5-2965 2.10 GHz CPUs each with 18 cores + 1-8 x Intel P3700 NVMe SSD drives) Vs SPDK with a single dedicated 2.10 GHz core allocated out of the 2 x Xeon E5-2965  for IO. You can clearly see the significantly better IO performance with SPDK, which, despite having just a single core, due to the lack of context switching and the related overhead, is linearly scaling the IO throughput in line with the number of NVMe SSD drives.

(In addition to these testing, Jonathan also mentioned that they’ve done another test with Supermicro off the shelf HW and with SPDK & 2 dedicated cores for IO, they were able to get 5.6 million IOPS before running out of PCI bandwidth which was impressive)

 

SPDK Applications & My Thoughts

SPDK is an end-to-end reference storage architecture & a set of drivers (C libraries & executables) to be used by OEMs and ISV’s when integrating disk hardware. According to Intel’s SPDK introduction page, the goal of the SPDK is to highlight the outstanding efficiency and performance enabled by using Intel’s networking, processing and storage technologies together. SPDK is available freely as an open source product that is available to download through GitHub. It also provide NVMeF (NVMe Over Fabric) and iSCSI servers to be built using the SPDK architecture, on top of the user space drivers that are even capable of servicing disks over the network. Now this can potentially revolutionise how the storage industry build their next generation storage platforms.  Consider for example any SDS or even  a legacy SAN manufacturer using this architecture to optimise the CPU on their next generation All  Flash storage array? (Take NetApp All Flash FAS platform for example, they are known to have a ton of software based data management services available within OnTAP that are currently competing for CPU cycles with IO and often have to scale down data management tasks during heavy IO processing. With Intel DPDK architecture for example, OnTAP can free up more CPU cycles to be used by more data management services and even double up on various other additional services too without any impact on critical disk IO? I mean its all hypothetical of course as I’m just thinking out loud here. Of course it would require NetApp to run OnTAP on Intel CPUs and Intel NVMe drives…etc but it’s doable & makes sense right? I mean imagine the day where you can run “reallocate -p” during peak IO times without grinding the whole SAN to a halt? :-). I’m probably exaggerating its potential here but the point here though is that SDPK driven IO efficiencies can apply same to all storage array manufacturers (especially all flash arrays) where they can potentially start creating some super efficient, ultra low latency, NVMe drive based storage arrays and also include a ton of data management services that would have been previously too taxing on CPU (think inline de dupe, inline compression, inline encryption, everything inline…etc.) that’s on 24×7 by default, not just during off peak times due to zero impact on disk IO?

Another great place to apply SPDK is within virtualisation for VM IO efficiency. Using SPDK with QEMU as follows has resulted in some good IO performance to VM’s

 

I mean imagine for example, a VMware VSAN driver that was built using the Intel DPDK architecture running inside the user space using a dedicated CPU core that will perform all IO and what would be the possible IO performance? VMware currently performs IO virtualisation in kernel right now but imagine if SPDK was used and IO virtualisation for VSAN was changed to SW based, running inside the user-space, would it be worth the performance gain and reduced latency? (I did ask the question and Intel confirmed there are no joint engineering currently taking place on this front between 2 companies). What about other VSA based HCI solutions, especially take someone like Nutanix Acropolis where Nutanix can happily re-write the IO virtualisation to happen within user-space using SPDK for superior IO performance?

Intel & Alibaba cloud case study where the use of SPDK was benchmarked has given the below IOPS and latency improvements

NVMe over Fabric is also supported with SPDK and some use cases were discussed, specifically relating to virtualisation where VM’s tend of move between hosts and a unified NVMe-oF API that talk to local and remote NVMe drives being available now (some part of the SPDK stack becoming available in Q2 FY17)

Using the SPDK seems quite beneficial for existing NAND media based NVMe storage, but most importantly for newer generation non-NAND media to bring the total overall latency down. However that does mean changing the architecture significantly to process IO in user-mode as opposed to kernel-mode which I presume is how almost all storage systems, Software Defined or otherwise work and I am unsure whether changing them to be user-mode with SPDK is going to be a straight forward process. It would be good to see some joint engineering or other storage vendors evaluating the use of SPDK though to see if the said latency & IO improvements are realistic in complex storage solution systems.

I like the fact that Intel has made the SPDK OpenSource to encourage others to freely utilise (& contribute back to) the framework too but I guess what I’m not sure about is whether its tied to Intel NVMe drives & Intel processors.

If anyone wants to watch the recorded video of our session from # SFD12 the links are as follows

  1. Jonathan’s session on SPDK
  2. Tony’s session on RDT

Cheers

Chan

#SFD12 #TechFieldDay @IntelStorage