The quest for perfect knowledge doesn’t start with a screen capture

One of the fun parts of my job is problem solving….  I wont lie… I love it.

Step one in problem solving is always the same:  define the problem.
Step two:  get the data needed to solve the problem.
Step three:  solve it!

Simple, right?

Wrong.

One of the reasons IT gets it wrong again and again is simple:  the assumption of perfect knowledge.   We assume that with one sentence or even worse, one screen capture, we have described the problem with enough depth that it can now be solved.   That the team now perfectly understand the problem and that the solution they supply will be…. wait for it….  you guessed it….  perfect!

2014-10-04_19-07-21

Don’t get me wrong, I love screen captures (using my favourite tool, Snagit).   In fact screen captures are one of my number one tools for writing documentation.  When I worked on IBM Redbooks (one of IBMs greatest free gifts to the IT community) I often found some chapters were more picture than text… and that was ok.   People need to see what it is you are talking about.

But when it comes to describing a problem, in the vein of a picture is worth a thousand words, a screen capture can be the devil itself.   The issue with screen captures is simple:   they contain information that cannot be easily searched or indexed (apart from with your eyeball).   They may show the problem or just barely validate that the problem exists, but they rarely help in SOLVING the problem.

Last week I got my favourite kind of screen capture, the one taken of a screen with a phone (with the reflection of the photographer clearly visible in the shot).   Apart from giving me the ability to rate that person’s fashion sense, these kinds of shots are among the worst.   Amusingly when I asked why I didn’t also get logs, I was told the customers security standards would not allows logs to be sent.   Yeah right… this is the same customer who doesn’t mind you standing in the middle of their computer room taking photos of their displays with your phone?

2014-10-12_21-03-20

So the next time you plan on sending a screen capture, stop for a minute and consider…  is this enough for a perfect solution?   Are there no logs I can send along with this picture?  Has the vendor supplied a tool I can use to offload data?   Or even better automatically send it?    Am I doing anything more than just describing the problem itself?

Posted in advice | 2 Comments

Shellshock and IBM SVC and Storwize products

While blogging last week about how various vendors have responded to the Shellshock exploit, I noted that several vendors, notably Oracle and Cisco were open about products that they did not yet have a fix for.     IBM meanwhile appears to be only announcing vulnerability after they have the fix.   In other words, vulnerable customers are left without formal notification that they are exposed, or made aware of any workarounds, until a fix is actually available.   I am left slightly annoyed by this policy.

MrPotatoHead_11The formal notification for the Storwize family and IBM SVC family came out here on October 11, 2014.  At time of writing these are the fix levels:

Remediation/Fixes
IBM recommends that you fix this vulnerability by upgrading affected versions of IBM SAN Volume Controller, IBM Storwize V7000, V5000, V3700 and V3500 to the following code levels or higher:

7.1.0.11
7.2.0.9
7.3.0.7

More importantly it contains this critical piece of information:

Vulnerability Details

The following vulnerabilities are only exploitable by users who already have authenticated access to the system.

In other words, the best way to manage exposure is to limit the number of users who have CLI access and to use network restrictions (such as ACLs and Firewalls) to restrict network access to your devices.

So kudos to IBM for creating fixed versions, I just wish that acknowledgement and remediation advice could have been published earlier.

 

Posted in Uncategorized | Tagged , , , , | Leave a comment

Vale Randall Davis

I received some very sad news last week that Randall Davis has passed away.

Randall was a very experienced and capable IT professional based in Melbourne Australia. He worked for IBM for many years;  co-authored several IBM Redbooks and fathered two wonderful boys with his wife Fiona.

Randalls funeral will be held in the Federation Chapel, Lilydale Memorial Park, 126-128 Victoria Rd, Lilydale on Wednesday Oct. 8, 2014, commencing at 11.15 am.

If you knew Randall and wish to pay your respects, then please attend.

Posted in Uncategorized | 2 Comments

Shell shocked by binary explanations

On September 24, 2014 some new exploits to gain unauthorized access to Unix based systems that have a bash shell were revealed.   Known collectively as shellshock it has caused tremendous consternation and activity in the IT industry.

What has proven interesting is the way each major vendor has chosen to respond to this issue. An enormous number of products, whether software, hardware or appliance, are affected.  You could almost safely assume that if a product can be accessed with a Unix like shell, then it is quite likely going to need patching, once the relevant vendor has released a fix.

But how can you know?

The best way is clearly if the vendor in question has released a statement and this is where things get interesting.    Some vendors have taken the attitude that when they have a fix, they will admit they have the vulnerability.

Ideally each vendor should post a list of:

  • Products that are not vulnerable
  • Products that are vulnerable but a fix is available
  • Products that are vulnerable and no fix is available (yet)
  • Products that may be vulnerable but testing is still in progress

This IBM website here happily lists unaffected products, but gives no guidance as to affected products.  You can see a screen capture below of the start of the unaffected product list.

2014-10-04_19-37-38

The DS8000 has a page here detailing available fixes, but its stable mate the Storwize V7000  (and V3700 and V5000 and the SVC) are also almost certainly affected, but not a peep on the internet from IBM about them.    I presume because a fix is being written but is not yet available

Oracle have a great page here which has four sections with titles like:

  • 1.0 Oracle products that are likely vulnerable to CVE-2014-7169 and have fixes currently available
  • 2.0 Oracle products that are likely vulnerable to CVE-2014-7169 but for which no fixes are yet available
  • 3.0 Products That Do Not Include Bash
  • 4.0 Products under investigation for use of Bash

Cisco have a great page here with a very similar set of information with sections like:

  • Affected Products
  • Vulnerable Products
  • Products Confirmed Not Vulnerable

EMC have a page here but as usual, EMC make it hard for us common people by putting it behind an authentication wall.

Posted in advice, Uncategorized | Tagged , | 1 Comment

Actifio Copy Data Forum – September 17 in Sydney

Its been a few weeks between posts, with the simple reason that I have been quite busy at Actifio!   One thing that is keeping me busy is an event we are running up in Sydney next week and it would be great to see you there.

You will hear from leading organizations like Westpac NZ, NSW Ambulance Service, and other Actifio customers, and learn how they have transformed their data management with Actifio Copy Data Virtualisation solutions.

At Copy Data Forum 2014, you will learn more about the proven business impact of Actifio, including:

Improved Resiliency – through instant data access for data protection and disaster recovery.

Enhanced Agility- putting data where you need it and when you need it.

Transition to the Cloud – ensuring your data follows your applications wherever they live, including public, private, and hybrid cloud-based systems.

Dramatic Savings – up to 90% reduction in storage costs, and up to 70% reduction in network bandwidth.

Sound interesting?     Register Now

Posted in Uncategorized | Leave a comment

Tortured by Tenders? Whats the problem?

Many Australian organizations, both Government and Private Enterprise acquire IT technology through a tender process.

No not that kind of tender….

Love Me Tenders

More like this kind of tender (anyone want a bridge shaped like a coat-hanger?).

tender-e-plan-5

The process of creating and responding to a tender actually involves what I call the three tortures:

  • The torture of creating the tender request document.   I have never met a client who enjoys the creation process.   Many resort to paying third parties to create them
  • The torture of responding to a tender.  I have never met a business partner or vendor who enjoys responding to one!
  • Then the final torture:  The torture of reading all those tender responses and selecting the winning one.   I have never personally experienced this torture but I can imagine how hard it must be reading all those vendor weasel words.

500px-Weasel_words.svg

I see five fundamental issues around tenders (well in Australia anyway).

1)  The lawyers are writing most of them (and it’s not helping one bit)

Saul

Most tender requests contain a huge amount of legal documentation.   Often less than 10% of all the words in the published documentation relates to technical or (more importantly) business requirements.
Quite seriously they often include 70-100 pages of legalese and 5-10 pages of truly useful back story as to why this tender has been released at all.
I am certain that every tender response needs to be stated inside a legal framework of responsibilities, but I have not seen any evidence that all of this legalese has prevented failed projects or bad solutions.

2)  Repetition in questions

I cannot over state how bad this situation is.   I have repeatedly seen tender documents that ask the same questions again and again and again (and again).

Even worse I see questions that are clearly not finished or questions that are missing huge amounts of obvious (and necessary) subtext or back story.  I have seen tenders where the quantity of question/answer documents created after the tender was published (as vendor questions are responded to) exceeded the quantity of technical detail provided in the initial documentation.  Quite frankly that’s just astonishing (in a bad way).

It seems that different teams each contribute to the total documentation and the person who compiles and publishes the document has no inkling just how much repetition has occurred in the process.   I don’t blame the authors – I blame the project manager who compiles their contributions and the timelines under which these tender documents are created.   Indeed my gut feel is that management simply don’t give the authors anywhere near enough time or resources to do a good job.

3)  Vendor bias

When you see a tender that asks for SRDF (as opposed to sync or async replication) you know there is a serious (EMC focused) bias.   Asking for Semi-Synch replication is nearly as bad (that marks it as a Netapp focused tender).
Many tenders are written with a specific outcome in mind, but all this leads to is weasel words, as all the other responders attempt to use their hammers to batter their products into the shape needed to answer the questions.

The issue is that the tender should really be about business outcomes enabled by IT, not IT solutions that someone thinks will lead to the best IT outcomes (and by implication, maybe, hopefully, the right business outcomes).

The idea of accepting that truly differentiated vendors will help you achieve better outcomes with differentiated technology simply doesn’t fit a straight question and answer response document.   The Q&A method only suits the accountants trying to score the responses.   But don’t worry all of that is handled next….

4)  No connection between technical requirements and financial reality

I have no issue with every organisation trying to get the best value for their money and the best possible outcomes for every major technical rework.
But if you want 2 Petabytes of Tier1 disk and your budget is $100K you are not going to get it.
Frankly most IT departments know full well what their maximum budget is, but if all but one tender response gets knocked back within 1 hour of submission because they all missed an unstated financial cut off, you have to question the efficacy of the whole process.   Invariably at least one vendor with the right contacts knows the ‘win price’. Everyone else was drifting off in the clouds.

Throw me a frickin bone

5)  May final gripe:   nowhere near enough lead time to get the responses written.   I routinely see tenders talked about for months and then released with less than 3 weeks to create the responses.  This tends to reflect the overall problem with timing…. everyone is simply too busy, but the end result is rushed bids.

So do you have a better perspective on what’s going on?  Feel free to share!

Posted in advice, Uncategorized | 3 Comments

Using AIX VG mirroring in combination with hardware snapshots

One of the great things about Logical Volume Managers is how you can use them for all manner of clever solutions.   I recently explored how to use a combination of hardware snapshots and LVM to create rapid backups without using backup software (or as a source for a data protection product).

To do this we need to do the following:

  1. We need to present a staging disk to the host, large enough to hold the data we are trying to protect.   In this example a volume group (VG) being used to hold DB2 data.  This disk could come from a different primary storage device (i.e. an XIV or a Storwize V7000) or could be an Actifio presented disk.   You need to check whether your multi-pathing software will work with that disk.
  2. We mirror our datavg onto our new staging disk using AIX VG mirroring.
  3. We take a hardware snapshot of that disk.
  4. We now allow the VG mirror to become stale to remove disk load on the host
  5. Prior to taking the next snapshot, we get the mirrors back in sync again.

This process clearly depends on whether you would prefer to leave the two copies in sync or let them go stale.   The advantage of letting them go stale is that the disk I/O workload needed to keep them in sync is avoided.  While you will need to catch-up later, the total effort to do this may well be significantly less than the continual effort of mirroring them.

Example configuration

We have a VG (called db2vg) with one copy.  We know only one copy exists because each logical volume in the volume group has only one PV.

[AIX_LPAR_5:root] / > lsvg -l db2vg
db2vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jfsdb2log1          jfs2log    1       1       1    open/syncd    N/A
jfsdb2log2          jfs2log    1       1       1    open/syncd    N/A
jfsdb2log3          jfs2log    1       1       1    open/syncd    N/A
db2binlv            jfs2       14      14      1    open/syncd    /db2
db2loglv            jfs2       10      10      1    open/syncd    /db2log
db2datalv           jfs2       40      40      1    open/syncd    /db2data

If I display the detailed view of the relevant VG I can see the VG is currently in a good state

[AIX_LPAR_5:root] / > lsvg -L db2vg
VOLUME GROUP:       db2vg                    VG IDENTIFIER: 00f771ac00004c0000000144bf115a1e
VG STATE:           active                   PP SIZE:        512 megabyte(s)
VG PERMISSION:      read/write               TOTAL PPs:      71 (36352 megabytes)
MAX LVs:            512                      FREE PPs:       4 (2048 megabytes)
LVs:                6                        USED PPs:       67 (34304 megabytes)
OPEN LVs:           6                        QUORUM:         2 (Enabled)
TOTAL PVs:          1                        VG DESCRIPTORS: 2
STALE PVs:          0                       STALE PPs:      0
ACTIVE PVs:         1                        AUTO ON:        yes
MAX PPs per VG:     130048
MAX PPs per PV:     1016                     MAX PVs:        128
LTG size (Dynamic): 512 kilobyte(s)          AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable
PV RESTRICTION:     none                     INFINITE RETRY: no

We have added one new disk to the server.  We know it’s not in use because it has no VG (it says none).

[AIX_LPAR_5:root] / > lspv
hdisk0          00f771acd7988621                    None
hdisk5          00f771acbf1159f6                    db2vg           active
hdisk6          00f771ac41353d73                    rootvg          active
[AIX_LPAR_5:root] / > lsdev -Cc disk
hdisk0 Available C9-T1-01 MPIO IBM 2076 FC Disk
hdisk5 Available C9-T1-01 MPIO IBM 2076 FC Disk
hdisk6 Available C9-T1-01 MPIO IBM 2076 FC Disk

We extend the VG onto the new staging disk and then mirror it. We specify the VG name (db2vg) and the name of the unused or free disk (hdisk0).

It takes a while so we run the mirrorvg command as a background task with &

[AIX_LPAR_5:root] / > extendvg db2vg hdisk0
[AIX_LPAR_5:root] / > mirrorvg db2vg hdisk0 &
0516-1804 chvg: The quorum change takes effect immediately.

We monitor the mirroring with a script.  I did not write this script but did modify it.   The original author (W.M. Duszyk) should thus be acknowledged!   Also thanks to Chris Gibson for help with this.

#!/usr/bin/ksh93
### W.M. Duszyk, 3/2/12
### AVandewerdt 01/05/14
### show percentage of re-mirrored PPs in a volume group
 [[ $# < 1 ]] && { print "Usage: $0 vg_name"; exit 1; }
vg=$1
printf "Volume Group $vg has ";lsvg -L $vg | grep 'ACTIVE PVs:' | awk '{printf $3}';printf " copies "
Stale=`lsvg -L $vg | grep 'STALE PPs:' | awk '{print $6}'`
[[ $Stale = 0 ]] && { print "and is fully mirrored."; exit 2; }
Total=`lsvg -L $vg | grep 'TOTAL PPs:' | awk '{print $6}'`
PercDone=$(( 100 - $(( $(( Stale * 50.0 )) / $Total )) ))
echo "and is mirrored $PercDone%."
exit 0

We can use this script to check if the VG is in sync.   You run the script and specify the name of the VG:

 [AIX_LPAR_5:root] / >./checkvg.sh db2vg
Volume Group db2vg has 2 copies and is mirrored 85%.

We wait for it to reach 100%

 [AIX_LPAR_5:root] / > ./checkvg.sh db2vg
Volume group db2vg has 2 copies and is fully mirrored.

If you want to see the exact state of the VG, lets look at the volume group details.   Note how each LV now has 2 PPs and the LV state is open/syncd.  An LV state of closed/syncd is not an issue if the LV is actually raw (rather than using a file system)  and it is not being used by the application.

[AIX_LPAR_5:root] / > lsvg -l db2vg
db2vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jfsdb2log1          jfs2log    1       2       2    open/syncd    N/A
jfsdb2log2          jfs2log    1       2       2    open/syncd    N/A
jfsdb2log3          jfs2log    1       2       2    open/syncd    N/A
db2binlv            jfs2       14      28      2    open/syncd    /db2
db2loglv            jfs2       14      28      2    open/syncd    /db2log
db2datalv           jfs2       40      80      2    open/syncd    /db2data

Now display that LV.   We can see hdisk0 is copy 2 (PV 2).  This is good.

[AIX_LPAR_5:root] / > lslv -m db2binlv
db2binlv:/db2
LP    PP1  PV1               PP2  PV2               PP3  PV3
0001  0002 hdisk5            0002 hdisk0
0002  0003 hdisk5            0003 hdisk0
0003  0004 hdisk5            0004 hdisk0
0004  0005 hdisk5            0005 hdisk0
0005  0006 hdisk5            0006 hdisk0
0006  0007 hdisk5            0007 hdisk0
0007  0008 hdisk5            0008 hdisk0
0008  0009 hdisk5            0009 hdisk0
0009  0010 hdisk5            0010 hdisk0
0010  0011 hdisk5            0011 hdisk0
0011  0012 hdisk5            0012 hdisk0
0012  0013 hdisk5            0013 hdisk0
0013  0014 hdisk5            0014 hdisk0
0014  0015 hdisk5            0015 hdisk0

We are now ready to snapshot the staging disk to preserve its state as it is in the synced state.  Once the snapshot is created, we can let the mirror go stale so that there is no disk load to keep the staging disk in sync.  You should co-ordinate this snapshot with the application writing to the disk.   With Actifio we do this with the Actifio Connector software.

Once the snapshot is taken we can split the VG to stop the workload of mirroring.   We are going to split off copy 2, which is the copy that is on our staging disk (hdisk0).  So now we split off a copy:

splitvg -c2 db2vg

The new copy is called vg00.  You can force AIX to use a different name.

[AIX_LPAR_5:root] / > splitvg -c2 db2vg
[AIX_LPAR_5:root] / > lsvg
db2vg
rootvg
vg00

If we check db2vg we can see it still shows 2 PPs but actually we are no longer keeping the second copy (on hdisk0) in sync.

[AIX_LPAR_5:root] / > lsvg -l db2vg
db2vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jfsdb2log1          jfs2log    1       2       2    open/syncd    N/A
jfsdb2log2          jfs2log    1       2       2    open/syncd    N/A
jfsdb2log3          jfs2log    1       2       2    open/syncd    N/A
db2binlv            jfs2       14      28      2    open/syncd    /db2
db2loglv            jfs2       10      20      2    open/syncd    /db2log
db2datalv           jfs2       40      80      2    open/syncd    /db2data

When we look at our newly created VG (vg00) it does not have 2 copies.

[AIX_LPAR_5:root] / > lsvg -l vg00
vg00:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
fsjfsdb2log1        jfs2log    1       1       1    closed/syncd  N/A
fsjfsdb2log2        jfs2log    1       1       1    closed/syncd  N/A
fsjfsdb2log3        jfs2log    1       1       1    closed/syncd  N/A
fsdb2binlv          jfs2       14      14      1    closed/syncd  /fs/db2
fsdb2loglv          jfs2       10      10      1    closed/syncd  /fs/db2log
fsdb2datalv         jfs2       40      40      1    closed/syncd  /fs/db2data

Curiously while we show as being in sync the sync actually is stale by 3 PPs already:

[AIX_LPAR_5:root] / > chmod 755 checkvg.sh;./checkvg.sh db2vg
Volume Group db2vg has 1 copies and is mirrored 99%.

I generate some change by copying some files to /db2data to increase this difference.   Of course if DB2 is really running then changes will start occurring straight away.

[AIX_LPAR_5:root] / > ./checkvg.sh db2vg
Volume Group db2vg has 1 copies and is mirrored 97%.

If we check the state of the LVs we can see that this file I/O has created stale partitions. This is not a problem.   The speed with which partitions become stale will depend on the size of the PPs and the address range locality of typical IOs generated between snapshots.

[AIX_LPAR_5:root] / > lsvg -l db2vg
db2vg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
jfsdb2log1          jfs2log    1       2       2    open/stale    N/A
jfsdb2log2          jfs2log    1       2       2    open/stale    N/A
jfsdb2log3          jfs2log    1       2       2    open/stale    N/A
db2binlv            jfs2       14      28      2    open/stale    /db2
db2loglv            jfs2       10      20      2    open/stale    /db2log
db2datalv           jfs2       40      80      2    open/stale    /db2data

When we are ready to take the next snapshot we need to get the two copies back together and in sync.   To do this we rejoin the two with this command:

joinvg db2vg

We can see the two start coming back to sync:

[AIX_LPAR_5:root] / > ./checkvg.sh db2vg
Volume Group db2vg has 2 copies and is mirrored 98%.

When the two get into sync we can clearly see this as the state is syncd rather than stale.

[AIX_LPAR_5:root] / > lsvg -l db2vg
 db2vg:
 LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
 jfsdb2log1          jfs2log    1       2       2    open/syncd    N/A
 jfsdb2log2          jfs2log    1       2       2    open/syncd    N/A
 jfsdb2log3          jfs2log    1       2       2    open/syncd    N/A
 db2binlv            jfs2       14      28      2    open/syncd    /db2
 db2loglv            jfs2       10      20      2    open/syncd    /db2log
 db2datalv           jfs2       40      80      2    open/stale    /db2data

If the resync does not occur, we can force it with the syncvg command:

syncvg -v db2vg

Once we are in sync, we can do another snapshot of the staging disk.

Issues with scripting this:

One thing you may want to do is allow a non-root user to perform these commands.  So for instance if we want to allow the DB2 user (in this example db2inst2) to execute splitvg and joinvg commands we can use sudo to do this.

  1. Download and install sudo on the AIX host
  2. Issue this command to edit the sudo config file:   visudo
  3. Add this line:
    db2inst2 ALL = NOPASSWD: /usr/sbin/joinvg,/usr/sbin/splitvg

Log on as the DB2 user and check that it worked:

[AIX_LPAR_5:db2inst2] /home/db2inst2 > sudo -l User
db2inst2 may run the following commands on this host:
(root) NOPASSWD: /usr/sbin/joinvg
(root) NOPASSWD: /usr/sbin/splitvg

Using the snapshot with a backup host

One strategy that can be used in combination with this method is to present the snapshot to a server running backup software.  The advantage of doing this is that the backup can effectively be done off-host.   The disadvantage is that each backup will be a full backup unless the backup software can scan the disk for changed files or blocks.

Import the VG

To use the snapshot, connect to the management interface of the storage device that created the snapshot and map it to your backup host.   Then logon to the backup host and discover the disks:

cfgmgr

Learn the name of the hdisk

lspv
lsdev -Cc disk

Then import the volume group.  You need to use -f to force an import with only half the VG members present (since you are importing a snapshot of one half of a mirrored pair).  In this example we have discovered hdisk1 and are using it to import the VG db2vg.

importvg -y db2vg hdisk1 -f

Recreate the VG

If you are presenting the snapshot back to the same host that has the original VG, then we have to do two extra steps.   Because the snapshot has the same PVID as the staging disk you need to change the PVID and use the recreatevg command, not the importvg command.

In this example I have two VGs and two disks.

[aix_lpar_4:root] / > lspv
hdisk0          00f771acc8dfb10a                    actvg           active  
hdisk2          00f771accdcbafa8                    rootvg          active

I map the snapshot I created and run cfgmgr.   If you are sharp eyed you will spot I don’t have any PVID clashes.   Actually I don’t even have the original DB2 VG, but the method is still totally valid.

[aix_lpar_4:root] / > cfgmgr
[aix_lpar_4:root] / > lspv
hdisk0          00f771acc8dfb10a                    actvg           active  
hdisk2          00f771accdcbafa8                    rootvg          active  
hdisk4          00f771acd7988621                    None

We need to bring the VG online, so we clear the PVID

 [aix_lpar_4:root] / > chdev -l hdisk4 -a pv=clear
hdisk4 changed
[aix_lpar_4:root] / > lspv
hdisk0          00f771acc8dfb10a                    actvg           active  
hdisk2          00f771accdcbafa8                    rootvg          active  
hdisk4          none                                None

We now build a new VG using the VG name db2restorevg on hdisk4.

[aix_lpar_4:root] / > recreatevg -f -y db2restorevg hdisk4
db2restorevg
[aix_lpar_4:root] / > lsvg -l db2restorevg
db2restorevg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
fsjfsdb2log1        jfs2log    1       1       1    closed/syncd  N/A
fsjfsdb2log2        jfs2log    1       1       1    closed/syncd  N/A
fsjfsdb2log3        jfs2log    1       1       1    closed/syncd  N/A
fsdb2binlv          jfs2       14      14      1    closed/syncd  /fs/db2
fsdb2loglv          jfs2       14      14      1    closed/syncd  /fs/db2log
fsdb2datalv         jfs2       40      40      1    closed/syncd  /fs/db2data

Again if you are sharp eyed you will spot in the output above every LV has fs added to its name.  In other words db2binlv that was mounted on /db2 is recreated as fsdb2binlv mounted on /fs/db2.   This is done because the recreatevg command assumes you are creating this VG on a host that already has this VG.   So it renames constructs to prevent name clashes.   If for some reason you don’t want this renaming to occur, you can avoid it in the recreatevg command like this, where -L / and -Y NA forces the command to not rename any labels.   Use this with care.

recreatevg -f -L / -Y NA -y db2restorevg hdisk4 

Backups without backup software or file system scans.

If the staging disk is presented by Actifio, then Actifio will track every changed block and will only need to read the changed blocks to create a new backup image of the snapshot.   The VG PP size will play a role in determining the quantity of changed blocks.  This effectively allows backups without backup software since the Actifio Dedup engine can read blocks straight from snapshots created by Actifio.   This is a very neat trick.    Also since we presented the staging disk from the Actifio snapshot pool, we now also have a copy that we can present at will for instant test and dev or analytics purposes.

Scripting for Application Consistency

When creating the snapshot, you ideally want the whole process to be orchestrated where a regular update job is run according to a schedule.  The process should get the VG mirror back into sync, get the application into a consistent state (such as hot backup mode), create a snapshot and then let the VG mirror go stale again.

The Actifio Connector can be used to coordinate application consistency.  Clearly if your staging disk is coming from a different storage product then you will need to use that vendors method.   Every time Actifio starts a snapshot job (which can be automated by the Actifio SLA scheduling engine) it can call the Actifio Connector installed on the host to help orchestrate the snapshot.  It does so in phases:  init; thaw; freeze; fini and if necessary abort.   We set the database name and path and VGname at the start of the script.   The init phase re-syncs the VG; the thaw phase puts DB2 into hot backup mode; the freeze phase takes DB2 out of hot backup mode; the fini phase splits the VGs.

#!/bin/sh
DBPATH=/home/db2inst2/sqllib/bin
DBNAME=demodb
VGNAME=db2vg
if [ $1 = "freeze" ];then
 $DBPATH/db2 connect to $DBNAME
 $DBPATH/db2 set write suspend for database
 exit 0
fi
if [ $1 = "thaw" ];then
 $DBPATH/db2 connect to $DBNAME
 $DBPATH/db2 set write resume for database
 exit 0
fi
if [ $1 = "init" ];then
 sudo joinvg $VGNAME
 while true
 do
 synccheck=$(/act/scripts/checkvg.sh $VGNAME)
 if [ "$synccheck" != "Volume Group $VGNAME has 2 copies and is fully mirrored." ]
 then
 echo $synccheck
 sleep 30
 else
 break
 fi
 done
 exit 0
fi
if [ $1 = "fini" ];then
 echo "Splitting $VGNAME"
 sudo splitvg -c2 $VGNAME
 exit 0
fi
if [ $1 = "abort" ];then
 $DBPATH/db2 connect to $DBNAME
 $DBPATH/db2 set write resume for database
 exit 0
fi

Hopefully this whole process is helpful whether you use Actifio or not.  Here is a small set of references which helped me with this:

Waldemar Mark Duszyk Blog

Chris Gibsons Blog

IBM Technote

Posted in Actifio, AIX, IBM | Tagged , | 2 Comments