Exact MSP Space Accounting on a Storwize Pool

I have blogged in the past about the classic IT Story, The Cuckoo’s Egg by Clifford Stoll.   A true story that details how Clifford discovered a hacker while trying to account for 9 seconds of mainframe processing time.

I was reminded of this recently while doing an MSP Space Accounting project.  MSPs (Managed Service Providers) are understandably cost focused as they try to compete with low-cost IAAS (Infrastructure As A Service) providers like Amazon.   To control costs, shared resources are normally employed as well as thin-provisioning and its cousin over-provisioning (don’t confuse them,  thin-provisioning just means using only the exact resources needed for an objective, where over-provisioning means promising or committing to more resources than you actually have, in the hope that no one calls your bluff.   You can always use thin-provisioning without using over-provisioning).

A Storwize pool can use both thin and over-provisioning.   As an MSP if you are looking at pool usage you may want to be clear exactly how much space each client in the shared pool is using.   Now I don’t want to burn time explaining the exact workings of thin provisioning (something that Andrew Martin explains very well here), but I wanted to point out a quirk that may confuse you while trying to do space accounting.

In this example I have a Storwize pool that is 32.55 TiB in size and is showing 22.93 TiB Used.  You can clearly see we have over-allocated the 32.55 TiB of disk space by having created 75.50 TiB of virtual volumes!

2016-05-01_13-37-16

Now this is significant because if I wanted to do space accounting I would expect the Used capacity of all volumes in the pool to sum up  22.93 TiB of space.  In other words if five end clients are sharing this space and I know which volumes relate to which client, I would expect the sum total of all volumes used by all clients to equal 22.93 TiB.

If I bring up the properties panel for the pool I can clearly see metrics for the pool including the extent size (in this example 2.00 GiB, remember that, it is significant later).

2016-05-01_13-37-36

Now for each thin provisioned volume I get three size properties:

Used: 768.00 KiB   
Real: 1.02 GiB   
Total: 100.00 GiB  

To explain what these are:

  • Used capacity is effectively how much data has been written to the volume (which includes the B-Tree to track thin space allocation).
  • Real capacity is how much space in grains has been pre-allocated to the volume from extents allocated from the pool.
  • Total capacity is the size advertised to the hosts that can access this volume.

This means I could sum either Used capacity or Real capacity.   Since Real capacity is always larger than Used capacity, it makes more sense to sum this.  Especially if this is the number I am using to determine usage by clients inside a shared pool.

To get the used space size of all volumes we need to differentiate between fully provisioned (Generic) volumes and Thin-Provisioned volumes.

This command will grab all the Generic volumes in a specific pool (in this example called InternalPool1):

lsvdisk -bytes -delim ,  -filtervalue se_copy_count=0:mdisk_grp_name=InternalPool1

This command will grab all the thin volumes in a specific pool (in this example called InternalPool1):

lssevdiskcopy -bytes -delim , -nohdr -filtervalue mdisk_grp_name=InternalPool1

Add the -nohdr option if you wish to use these in a script.

So for the generic volumes we can sum the capacity field.   In this example pool, I used a spreadsheet and found it sums to 19,404,662,243,328 byes

So for the thin volumes we can sum the real capacity field.   In this example pool,  I used a spreadsheet and found it sums to 5,260,831,053,824 bytes.

This brings us to a combined total of 24,665,493,297,152 bytes which is 22.43 TiB.

The problem here is obvious.   I expected to account for 22.93 TiB of space, but summing the combined total of actual capacity for full-fat volumes and real-capacity for thin volumes doesn’t add up to what I expect.  In fact in this example I am short by around 0.5 TiB of used capacity.  How do I allocate this space to a specific client if no volume owns up to using it?

I can actually spot this in the CLI as well using just the lsmdiskgrp command.  If I subtract real capacity 24,665,493,297,152 from total capacity 35,787,814,993,920 I get 11,122,321,696,768 bytes, which is nowhere near reported free capacity of  10,578,504,450,048 bytes.  This again reveals 543,817,246,720 bytes (0.494 TiB) of allocated space that is not showing against volumes.

IBM_Storwize:Actifio1:anthonyv>lsmdiskgrp -bytes 0
 id 0
 name InternalPool1
 status online
 mdisk_count 1
 vdisk_count 525
 capacity 35787814993920
 extent_size 2048
 free_capacity 10578504450048
 virtual_capacity 83010980413440
 used_capacity 23916077907968
 real_capacity 24665493297152

The answer is that the space is actually allocated to volumes, but is not being accounted for at a volume level.   If you scroll up to the second screen shot showing the Pool overview you can see the Extent Size is 2 GiB.   That means the minimum amount of space that gets  allocated to a volume is actually 2 GiB.  But if we look at the volume properties of a single volume, there is no indication that this volume is actually holding down 2 GiB of pool space.     In this example I can see only 1.02 GiB of space being claimed.  So for this example volume there is actually 0.98 GiB of space allocated to the volume which is not actually being acknowledged as being dedicated to that volume.

2016-05-01_14-36-23

So how do I cleanly allocate this 0.5 TiB?

I see two choices.   The first is to simply determine the shortfall, divide it by the number of thin allocated volumes and then add that usage to each thin volume.     In this example I have 519 thin volumes, so if I divide  543,817,246,720 by 519 thats pretty well 1 GiB per volume I could simply add to that volume’s space allocation.

The second is to accept it as a space tax and simply plan for it.   The issue is far less pronounced if the volume quantity is small and the volume size is large.  The issue is also far less pronounced with smaller extent sizes.   At very small extent sizes it in fact will most likely not occur at all or be truly trivial in size (like Clifford’s 9 seconds). In this example simply using 1 GiB extents would have pretty well masked the issue.    But remember that the smaller your extent size, the smaller your maximum cluster size can be.  A 2 GiB extent size means the maximum cluster size is 8 PiB.

 

 

Posted in Uncategorized | Leave a comment

Mapping Linux RDMs to Storwize Volumes

As a follow-up to my previous post about MPIO software and RDMs, I suggested SDDDSM could help you map Windows volumes to Storwize volumes.    This led to the obvious question:   What about Linux VMs?

In a distant time there was a version of IBM SDD for Linux (in fact you can still download it).  But because it was closed source and used compiled binaries, it meant that users could only use specific Linux distributions/Kernel versions.    This was rather painful (especially if you upgraded your Linux version due to some other bug and then found SDD no longer worked).    Fortunately native Multipathing for Linux rapidly matured and offered a simple and native option that is definitely the way to go (and please don’t listen to the vendors pushing proprietary MPIO software, integration native to the Operating System using vendor plug-ins is in my opinion  the only acceptable MPIO solution).

Either way, it turns out you don’t even need multi path software to map a Storwize Volume to an Operating System device.

In this example I have created a volume on a Storwize V3700 with a UID then ends in 0043.

2016-04-17_18-01-28

It is mapped as a pRDM to a VM, I can see the same UID under the Manage Paths window.  You can see the same UID at the top of the window (ending with 0043).

2016-04-17_18-04-12.jpg

On the Linux VM that is using this VM, I want to confirm if the device /dev/sdb matches the pRDM.   In this example we use the smartctl command.   We can clearly see the matching Logical Unit ID  (ending in 0043), so we know that /dev/sdb is indeed our pRDM.

[root@demo-oracle-4 ~]# smartctl -a /dev/sdb
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-573.3.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor: IBM 
Product: 2145 
Revision: 0000
User Capacity: 5,368,709,120 bytes [5.36 GB]
Logical block size: 512 bytes
Logical Unit id: 0x60050763008083020000000000000043
Serial number: 00c02020c080XX00
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Local Time is: Sat Apr 16 23:16:09 2016 EDT
Device does not support SMART

Error Counter logging not supported
Device does not support Self Test logging
[root@demo-oracle-4 ~]#

If you find smartctl is not installed, then install the smartmontools package:

yum install smartmontools

If we have Linux multipath configured, we can also use the multi path -l (or -ll) command to find the UID and determine which Storwize Volume is which Linux device.  Again I can easily spot that mpathb (sdb) is my Storwize volume with the UID ending in 0043.

[root@centos65 ~]# multipath -ll
mpathb(360050763008083020000000000000043) dm-6 IBM,2145
size=5G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=50 status=active
 `- 5:0:1:0 sdb 8:96 active ready running

So Linux users will actually find it quite easy to map OS disks back to the Storwize volume.

 

Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000 | Tagged , , | Leave a comment

Do RDMs need MPIO?

I got a great question the other day regarding VMware Raw Device Mappings:

If an RDM is a direct pass though of a volume from Storage Device to VM, does the VM need MPIO software like a physical machine does?

The short answer is NO,  it doesn’t.  But I thought I would show why this is so, and in fact why adding MPIO software may help.

First up, to test this, I created two volumes on my Storwize V3700.

2016-04-11_07-51-39

I mapped them to an ESXi server as LUN ID 2 and LUN ID 3.  Note the serials of the volumes end in 0040 and 0041:

2016-04-11_07-56-03

On ESX I did a Rescan All and discovered two new volumes, which we know match the two I just made on my V3700, as the serial numbers end in 40 and 41 and the LUN IDs are 2 and 3:

2016-04-11_07-58-02

I confirmed that the new devices had multiple paths, in this example only two (one to each Node Cannister in the Storwize V3700):

2016-04-11_08-00-20

I then mapped them to a VM as RDMs, the first one as a Virtual RDM (vRDM), the second as a Physical (pRDM):

2016-04-11_08-01-34

Finally on the Windows VM I Scanned for New Devices and brought up  the properties of the two new disks.   Firstly you note that the first disk (Disk 1) is a VMware Virtual disk while the second disk (Disk 2) is an IBM 2145 Multi-Path disk.   This is because the first one was mapped as a vRDM, while the second was mapped as a pRDM.

2016-04-11_08-04-27

So here is the question, if the Physical RDM is a multi-path device, does it have one path or many?      The first hint is that we only got one disk for each RDM.  But what do I see if I actually install MPIO software?    So I installed SDDDSM and displayed path status using the datapath query device command

C:\Program Files\IBM\SDDDSM>datapath query device

Total Devices : 1

DEV#: 0 DEVICE NAME: Disk2 Part0 TYPE: 2145 POLICY: OPTIMIZED
SERIAL: 60050763008083020000000000000040
============================================================================
Path#    Adapter/Hard Disk          State  Mode    Select Errors
    0  Scsi Port2 Bus0/Disk2 Part0  OPEN   NORMAL      86      0

C:\Program Files\IBM\SDDDSM>

What the output above shows is that there is only one path being presented to the VM, even though we know the ESXi HyperVisor can see two paths.

So this proves we didn’t actually need to install SDDDSM to manage pathing, as there is only one path being presented to the disk (the HyperVisor is handling the multiple paths using its own MPIO capability VMW-SATP-ALUA, which we can see in the ESXi pathing screen capture further up above.

Having said all that, there is one advantage from the Windows VM perspective to have SDDDSM installed, which is that I can see that Disk2 maps to the V3700 volume with a serial that ends in 40 (rather than 41).   So If I wanted to remove the vRDM volume (Disk 1) I know with safety that the volume ending in ’41’ is the correct one to target.

 

Posted in IBM Storage, Storwize V3700, Storwize V7000, Uncategorized, vmware | Tagged , | 6 Comments

Evergreen Storage? Can it actually work?

Pure Storage is one of several hot flash vendors in the market right now.   Despite some negativity about their recent IPO, it actually shows that the market thinks they have got their product and execution right.

One challenge for every Flash vendor out there (and there are quite a few) is to be able to explain the why.   Why my product and not another vendors?

One thing Pure Storage promote as a strong ‘why us‘  is their concept of Evergreen Storage, described here:

http://www.purestorage.com/content/dam/purestorage/pdf/datasheets/Pure_Storage_Datasheet_Evergreen.pdf

Fundamentally they are saying that as technology evolves, their modular physical design and stateless software design will allow you to upgrade components without having to move data or do any of these forklift upgrades.  Here is an image from their brochure:

2015-11-13_19-10-33

Even with Storage vMotion, the need to move data between storage arrays remains a major additional cost of replacing or upgrading storage hardware, and the ability to minimise or eliminate this work is definitely a huge plus.

But can they actually do it?  Do we have working examples of other vendors achieving this?

There is actually a good working model of a product that has done exactly this since 2003: The IBM SAN Volume Controller.     When IBM released the SVC in 2003, the first model (the 4F2), had only 4 GB of RAM per node with 2 Gbps FC adapters.   Since then, IBM have released a succession of new models as Intel hardware has evolved, with the current nodes having at least 32 GB of RAM, dramatically more cores, and optional 16 Gbps FC adapters!

The neat thing is that clients who invested in licensing in 2003, have been able to upgrade their nodes, with data in place, over successive years.   The cost of new nodes has been relatively low compared to the performance and functional benefits that each release has provided.   So I know for a fact that this idea of an Evergreen storage product is not only possible, but positively demonstrated by IBM.

The challenge for any vendor trying to do this is three fold:

  1. The technology really has to support seamless upgrades.   While the IBM SVC certainly did and does, there were some minor hiccups along the way.   One example was that first model, the 4F2, could not support the later 64 bit firmware releases, which meant that if you held off upgrading for too long, upgrading to new hardware needed some special help or a double hop to get the upgrade going.    Another example is bad racking:   Racked and stacked badly, pulling one node out could result in a partner node being disturbed (something I sadly have seen).
  2. The vendor needs to remain committed to the product.   While I laud IBM’s success with the SVC (now going even stronger with its Storwize brothers),  a sister product released at the same time, the Storage File System (sometimes called Storage Tank), did not get market traction and did not progress very far before being replaced by GPFS (which was not exactly a one for one replacement).  And while the DS8000 continues going strong (long after Chuck Hollis, in a classic piece of EMC FUD,  declared it dead),  its little sister, the DS6800, truly was dead within months of being released.   Its early months were so drama laden (sometimes sadly referred to as a crit-sit in a box) that new models were never released, which was equally sad, as once the code stabilised it became a great product.
  3. The vendor needs to hang around.   This one seems fairly obvious.   Clearly if someone were to buy Pure Storage (if the structure of the company allowed someone to do this), they also need to support this strategy.

So can Pure Storage do it?   Only time will tell, but they have made a great start and the industry has shown the concept is possible.   I will watch their progress with great interest!

 

Posted in Uncategorized | 3 Comments

vSphere ESXi6.0 CBT (VADP) bug that affects incremental backups / snapshots.

VMware recently posted a new KB article 2136854 to advertise a new issue that has been found with their Changed Block Tracking (CBT) code.

It’s important to note that this is not the same one as posted recently also for ESXi 6.0 (KB 2114076) – now fixed in a re-issued build of ESXi 6.0 (Build 2715440)

But it is very similar to KB 2090639 from a historical perspective.

The Issue

If you are leveraging a product that uses VMware’s VADP for backup, then chances are you are leveraging this for not just initial fulls, but regular incremental snapshots (for backup purposes). There are numerous products on the market that leverage this API, it’s virtually the industry standard to use this feature as it results in faster backups.

When the incremental changes are being requested through the API (QueryDiskChangedAreas) the API is requested changed blocks, but unfortunately some of the changed blocks aren’t being correctly reported in the first place, so backup data is essentially missing. And backups based on this can be inconsistent when recovered and result in all sorts of problems.

The Challenge

Currently there is no resolution or hotfix to the issue from VMware. I hope that we will see something in the coming days due to the wide ranging impact to customers and partner products affected.

The Workarounds

The workarounds in the KB suggests:

  1. Do a full backup for each backup, and that will certainly work, but it’s not really a viable fix for most customers (ouch !)
  2. Downgrade to ESX 5.5 and virtual hardware back to 10 (ouch !)
  3. Shutdown the VM before doing an incremental  (ouch !)

From the testing we have done at Actifio, option 3 doesn’t actually provide a workaround either, and options 1 & 2 aren’t really ideal.

The Discovery

When Actifio Customer Success Engineers discovered the issue, we contacted VMware and proved the problem leveraging just API calls to demonstrate where the problem was. How did we discover the issue I hear you ask?  Well we managed to discover the issue via our patented fingerprinting feature that occurs post every backup job. This technique (feature) essentially has learnt to not trust the data we receive (history has proven this feature to be useful many times) but to go and verify it against our copy and the original source copy. If we receive a variance in any way, we trigger an immediate full read compare against the source and update our copy. This works like a Full Backup job, but doesn’t write out a complete copy again, it just updates our copy to line up with the source again (as we like to save disk where we can!). We’ve seen this occur from time to time with our many different capture techniques (not just VADP), so it’s a worthy bit of code to say the least that our customers benefit from.

Let’s hope there’s a hotfix on the near horizon, so the many VADP / CBT vendor products that rely on it, can get back to doing what we do best and that’s protecting critical data for our customers that can be recovered without question.

—-

Thanks to Jeff O’Connor for writing this up.   You can find his blog here:  http://copydata.tips

Posted in Actifio, vmware | 2 Comments

Accessing the Instrumentation

Here are some rather bad photos of my 1972 Holden HQ Kingswood Premier, one of my first ever cars (and one that I sadly no longer own):

IMG_0001 IMG_0009

This was the V8 Four Litre model (actually 253 cubic inches, often jokingly described as having all the power of a 6 cylinder with the fuel economy of an 8).   The engine bay was so huge and empty I could open the bonnet and sit on the side of the car with my feet comfortably inside the bay while I changed spark plugs or cleaned the points.

The Kingswood was not what you would call an instrumented vehicle.   The dashboard had a speedo, a fuel gauge and three lights:   Temperature, Oil and Charging.   I dubbed these three lights the idiot lights: as once they came on,  you were the idiot.  (sorry, no picture; this was the 1980s).

Modern storage infrastructure by comparison is slightly more instrumented.   A vast array of metrics are tracked and these can be used to perform all sorts of analysis.   Analysis like:

  • Are my hosts getting good response times?
  • Are specific disks or arrays being over worked?
  • Are my fibre ports being used in a balanced fashion?

So can you do this with the Storwize products?    Of course!   I documented the built-in tools here (where I talked about the Performance GUI):

https://aussiestorageblog.wordpress.com/2011/12/06/svc-and-storwize-v7000-release-6-3-performance-monitor-panel/

And here (where I talked about the performance CLI):

https://aussiestorageblog.wordpress.com/2012/06/22/storwize-v7000-and-svc-performance-monitoring/

But these tools have only limited usefulness.   They are not granular, in that you cannot look at specific hosts or specific arrays or specific FC ports (meaning the three analysis ideas I suggested above are not even possible).   So how can we do this analysis?

The good news is that Storwize products do track all the metrics needed to do very granular analysis and these are freely accessible.   These files are documented by IBM, here is a fairly old page that documents some of them:

http://www-01.ibm.com/support/docview.wss?uid=ssg1S1003597

But how to turn these into something useful?   There used to be a tool called svcmon but this tool appears to have been killed as per this rather sad blog post:

https://www.ibm.com/developerworks/community/blogs/svcmon?lang=en

There is another IBM Community developed tool called qperf which you can access using the link below:

http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105947

With a graphing tool here:

http://www-01.ibm.com/support/docview.wss?uid=tss1td106040

And another tool here:  http://www.stor2rrd.com/

And yet another one here!   https://code.google.com/p/svc-perf/

The challenge for many of these tools is that they require manual setup, usually have a limited database engine and analysis is not always easy or simple.

You can of course use IBM’s TPC:

http://www-03.ibm.com/software/products/en/tivostorprodcent

You could also consider Intellimagic.   Although I have not looked too deeply at this one, these guys wrote IBM’s Disk Magic tool, so they certainly understand storage performance

https://www.intellimagic.com/solutions/san-storage-and-fabric/intellimagic-vision

The challenge for all Storage Admins is that they are not always experts at diagnosing performance issues.    Getting some genuine examples of the thinking process and the flow of getting from problem to solution, is vital.   This makes  BVQ  another good choice.

To see an example of how instrumented data presented in a graphical format can be used to generate a useful problem analysis, check out this blog post here:

https://bvqwiki.sva.de/pages/viewpage.action?pageId=28180509

and another one here:

https://bvqwiki.sva.de/display/BVQ/Performance+bottleneck+analysis+on+IBM+SVC+and+IBM+Storwize+V7000

I really like these posts for two reasons:

  1. They clearly shows just how instrumented the product is
  2. They clearly show how using this data in a graphical format can lead to good and quick root cause analysis.

Also have a look at some of these videos:

https://bvqwiki.sva.de/pages/viewpage.action?pageId=3834329

So how are you instrumenting your Storwize?
What do you find the easiest tool to use?

 

 

 

Posted in Uncategorized | 5 Comments

DevOps Culture lessons for all of us?

A colleague of mine recently pointed me at this fascinating document:

https://puppetlabs.com/2015-devops-report

It seems everyone is talking about DevOps right now, and if we accept even half of what this report tells us, to ignore what it is saying would be an opportunity lost.   In essence the report suggests:

High-performing IT organizations experience 60 times fewer failures and recover from failure 168 times faster than their lower-performing peers. They also deploy 30 times more frequently with 200 times shorter lead times.

Can we believe these numbers?   There is a risk that measured results don’t totally equate to the benefits of DevOps adoption.   After all many of the early adopters in this space were in fast growing or new-market sectors.   They were probably going to grow rapidly regardless, simply because they were already innovating in new areas.    However what they do clearly show is that if you can achieve high rates of change, but with lower levels of risk, you can adopt faster to market needs and customer demands.   And that is what would provide you with competitive advantage.

A simple example of speed to market I struck literally days ago, after upgrading my iPhone to iOS 9 and then finding my Banks iPhone app kept crashing.  They had missed the upgrade boat so to speak and took a week to catch up.   In some ways I should be pleased it was only a week, but in today’s economy, seven days is a lifetime.

However the part of the report that really struck a chord for me was the section titled Why Culture Matters.   EVERYONE in every company should read this section.   Print out these tables.  Tape them to the desk or wall.   Bring them to meetings.   Reflect loudly….  what kind of manager do you have?   What kind of culture is your management team engendering?

2015-10-01_11-30-25

More importantly are these strategies being followed?

2015-10-02_09-01-07

On the same day, I read/listened to this:

http://www.abc.net.au/radionational/programs/saturdayextra/small-changes-to-workplace-culture-can-reap-big-dividends/6810108

The fundamental message being that providing employees with places to gather and chat informally can generate huge benefits.   Can this even occur in companies who don’t even provide their workers with tea and coffee?

I finished my day reading the DevOpsGuys blog.   I loved this discussion of technical debt (and the blog from Box referenced in the comments).   How many organizations out there are burdened by technical debt?

Don’t know what I mean?   Read the blog:

http://blog.devopsguys.com/2015/07/31/devops-and-automating-the-repayment-of-technical-debt/

 

Posted in Uncategorized | Leave a comment