I helped edit a series that ran in VAR Business called Sam’s SAN Diary. It is written by Sam Blumenstyk, the technology operations manager at Schulte Roth & Zabel, a Manhattan law firm that’s typical of midsize companies in general. Blumenstyk was in the midst of a major upgrade to his company’s storage infrastructure, and built his first storage-area network (SAN). I asked him to keep a diary of the trials and tribulations, and he has graciously agreed to share this experience. We produced a weekly serial column together, that ran for dozens of episodes. Most of them have been deleted.
Episode #1: Introduction
I have been with SRZ for less than a year, originally tasked with a project to evaluate our disaster-recovery needs. But now I am deep into my first SAN. Even though I have designed numerous server rooms and enterprise networks and have implemented a variety of storage solutions for prior employers and consulting clients, I had no expertise in SANs when I started with the firm.
I supervise a basic offsite tape rotation operation that covers backups of all of our critical data. I have a staff of plus, plus there’s also a tech analyst. My boss, the company’s IT director, had begun working on an offsite data-recovery plan before I came on board and had come to the conclusion that a replicated data solution was needed. But a tape-recovery solution would have taken longer for us to get back online. It would be risky as well: What happens if a critical tape isn’t readable when it was needed? So began the path down to my first SAN.
We are a Windows-only shop with about 700 desktops and 30 servers, holding about 3 TB of data capacity with half of that already filled. What makes our storage situation somewhat unique is that our clients can bring in a terabyte of data (or more) on one particular case. Law firms these days do a lot of electronic discovery, where a legal team has to sift through millions of electronic documents, and all of that has to be stored someplace on my network. These huge storage needs are happening more and more these days.
IBM and Sungard, as well as smaller, regional firms we had been in contact with, had recommended a host-based replication solution from NSI Software called Double Take. It was relatively mature and worked by installing special drivers to track file input/output changes. So we wrote up a 10-page RFP for our disaster-recovery solution. We included specifics on storage-array controller-based replication solutions, along with a back-to-operation time of between four and 12 hours and specs for data-loss tolerance.
We distributed 20 RFPs — mainly to solution providers — and about 15 firms responded, of which 10 actually responded to what we asked for. We followed up with about six of the most promising, asking them to come in to meet with us to better understand their proposed solution and pricing. We were clear that this first set of meetings was to determine a general approach so that we could properly budget to implement the disaster-recovery solution during the next year.
A number of solution providers suggested a SAN as part of the solution, which ranged from hundreds of thousands of dollars to millions of dollars. Those included solution providers that proposed our direct-attached storage could be used with host-based replication to a SAN at their facility. Options were presented based both on purchasing a SAN for the disaster-recovery side and for connecting our servers to a multiclient SAN run by the disaster-recovery vendor. There were also solution providers that thought we should implement a SAN at both ends (meaning our offices and the offsite disaster-recovery location) to allow for SAN-based replication. That were more expensive, but offered a much shorter back-to-operation time. A SAN-based replication investment was relatively common in the financial trading industry prevalent in Manhattan, both for regulatory and business needs.
I don’t want to give you the impression that our SAN needs were all based around disaster-recovery planning. I mentioned earlier our needs to support our litigation storage requirements. We also had a growing data challenge with Microsoft’s Exchange 2000 servers requiring more storage to support mail-retention applications and specialized archiving products that are designed around the legal industry. We were in the process of rebuilding several of these servers, redistributing the storage of our mail files, and realized that a SAN would save our IT staff a lot of time.
This was last October, and we realized this was not a project that would be completed quickly. In December, we considered implementing a tactical solution of a low-end SAN to provide some relief on our storage needs and to get our feet wet with a SAN before we rolled out the full disaster-recovery-based solution. We used Manchester Equipment, a local Hauppauge, N.Y.-based VAR, that we have been dealing with for some time, for the project.
Episode #5: Figuring out the specs
And if we want pricing to get real with the vendors, let’s identify how much disk is used on each LUN. Hopefully they will look at the application mix (Exchange 2000 and SQL Server included) to propose pricing based on some engineering considerations. Rumor has it that EMC pricing ties its software licenses to the amount of disks being managed, not just the hardware platform. Wonder if that’s true, and whether HP does the same? I guess I have more research to do at this point.
Pricing of the hardware components in a SAN is not just about the storage array. I have to provide a server count for those boxes accessing LUNs to determine the required switch ports. There are also the Fibre Channel-to-IP routers, the HBAs, the switches, the GBICs and the cables. A lot of this stuff wasn’t all that familiar to me, but I have been boning up on what all these bits and pieces really do in the past few months.
The firm we visited in late March that was using the EMC Symmetrix box ended up following “industry best practices” and going with dual-switching fabric/dual HBA. That means twice the switches and twice the HBAs. And it did this not only for the SAN-attached servers at the production site, but for the disaster-recovery site as well.
I’m not sure whether a similar configuration would quite match our budget. But I’ll get pricing on both configurations and worry about what my recommendation will be in a few weeks.
Episode #7: Software demos
This week we had a chance to actually use the software on some demonstration lab systems with both HP and EMC. There were some similarities — both were Web-based Java applications. I’ll get to the differences below.
The EMC demonstration was done via VPN from our conference room to its Berkeley Heights office, where the company has three Clariions that its engineers use for training. The presales engineer did some of the demo himself and let us drive for the rest. Navisphere is an impressive product! The interface is intuitive, the functionality accessible in menus or right clicks, and the tree view was easy to follow. There were a few concepts to grasp, such as CLARiiON domain and storage groups, but it was not rocket science.
The EMC presenter, Chris Eng, was excellent. We got to see the main functionality from beginning to end — we created LUNs on the source and target sides, replicated data, simulated a broken link and mounted the target to the server. Due to our team being distracted for part of the meeting time (our mail server was being spammed), we did not see the local snapshot facilities. We will want to work with this again to see that, but now have a big plus in the software side for EMC’s Clariions ease of use.
The demo highlighted one issue we heard about from our Sungard engineer, who will be acting as the systems integrator for our project. When we use SAN-based replication and have our LUNs of-line at the target side (unlike host-based replication, where the servers are accessing the data at the target side), NT or Windows 2000 wants to do a CHKDSK once the LUN is presented. In the EMC demo with a 1 gig LUN, we said no to that prompt and were OK, but the Sungard engineer says in a real disaster we will want to say yes. That’s additional time we have to factor into our RTO (Recovery Time Objective).
The HP session was done at the vendor’s midtown offices in a large demo room, standing at a rack with a KVM switch and a couple of servers. (Not as comfortable, and harder to see.) While EMC’s demo browser pointed to a storage processor on a Clariion. HP’s demo browser pointed to a stripped-down DL380 called the Management Appliance. That server needed a reboot three minutes into the demo — not a good sign. We did a local Snapshot, but not the remote replication as it only had one EVA in the room. (That demo, Continuous Access, is set for a customer site next week.) The application was not end-user friendly, but the interface works for someone who approaches the tasks like an engineer. We did see two HP features as distinctions between it and EMC: easy growth of a LUN thanks to virtualization and capacity-free snapshots.
Currently, if you have a multiple LUNs in a CLARiiON RAID group you cannot allocate additional space in that RAID group to grow the LUN. Easier growth of a LUN is a feature EMC expects to improve on in future software releases More cannot be said without violating confidentiality understandings. Capacity-free snapshots is HP’s lingo for: copy the inode table, write out blocks as they are changed but do NOT preallocate space for those blocks. In the QuickSpecs data sheet, HP Storage Works Business Copy EVA its highlighted as “Snapshots” and “Vsnaps”; in the browser software, we found it “buried” in the management appliance’s Snapshot menu’s “Advanced” button.
Episode #33: Functional testing and remaining work
The last part of our disaster-recovery project was the application testing. The replicated SQL databases needed some scripts run to update records in our document-management system so that the records could point to the files’ new locations on the disaster-recovery servers. Also, those databases being replicated at the dump level (rather than at the transaction level) needed to be restored.
We also had to modify our login script to account for a new Home directory, which was set at both the Active Directory level and the login script. Testing that part of the login script that drops the production Home directory link and sets up the disaster-recovery Home directory link required us to ensure the communication link was down, as it would be in a real disaster.
Access to the disaster-recovery Citrix Metaframe XPE server had to be enabled though the Sungard firewall. All of the applications tested out just fine. Our SAN and disaster-recovery project concept is a success. The major technical challenges have been met and the concepts have been proved. The HP Proliant servers, Emulex HBAs, Brocade switches, CNT routers and the EMC Clariion have worked as promised. And the EMC professional services team has been outstanding. This first test represents reaching an important milestone; however, one is never truly done with a business-continuity project.
During these past 12 months, we have published more than 30 diary entries. Time has come to wrap up this “Weblog.” First. let me tell you what we have been doing since March and what we have on our plate for the rest of the year. They fall into five main areas:
1. We have additional systems that we must migrate to the SAN, replicate data, install servers at the disaster-recovery site to present data, and add applications to the Citrix remote desktop. Some will be done with the storage-based replication Clariion solution, and others will likely be done with Windows Storage Server 2003 host-based replication of data on the MSA platform.
2. There are scenarios short of a full-scale disaster that we have identified but not planned for. These must be evaluated and discussed with management. An important part of technology disaster-recovery continuity is constantly reviewing business needs, evolving regulatory requirements and repeating tests.
3. There is related data-management work we still need to attend to. The new SAN architecture and replication site give us an excellent platform to improve our data-protection and backup strategies for situations short of disasters. That includes EMC’s Snapshots, better use of the HP tape library we purchased, and possibly some disk-based backup (DtDtT) solutions, either locally or at the disaster-recovery site.
We may also take a close look at the eVault product, which will require its own storage. Eventually, we have to develop a full data life-cycle management strategy. Just because SAN technology makes it possible to keep loads of data does not mean that business needs call for its retention.
4. Exchange recovery is still not fully designed. Microsoft Consulting Services (Carl Solazzo and Jenn Goth) led us in an excellent two-day architecture discussion, with substantial EMC input on the first day as to the data component. The conclusion on both EMC’s and Microsoft’s part is that SRZ’s needs (Exchange 2000 replication over IP) requires moving to the Windows 2003 and Exchange 2003 platform for Volume Shadow Services support, augmented with appropriate, updated EMC products.
Until those products are available, we will be exploring host-based replication for Exchange. This was a general strategy initially rejected (see discussion of DoubleTake in Week 2). However, we owe the firm a better Exchange solution now, so this is back on the table. Since the project began, EMC has purchased Legato, and it is providing the most favorable terms imaginable if we use its comparable product, RepliStor.
5. SunGard has been working with us to implement its monitoring services. That is taking longer than either of us wanted, in part because SunGard reorganized just as we were going live. I am optimistic that it will be a good partner, once we work through the remaining monitoring issues.
FINAL Episode #34: Lessons learned for VARs
1. The customer is likely to be a novice. The technology is very new. When I began networking in the ’80s with IBM’s coax network and later Pronet-10, it was hard to actually accept that packets from all the computers were traveling over the wire and that all the PCs could thus communicate. Similarly, customers will need to learn that disk I/O amounts to SCSI commands, and a disk array with proper buffering and caching can deliver performance that is actually faster than internal drives. And the terms can be confusing, too. Fibre is a protocol, while fiber is a media. Add in host initiators, WWN masking and options to do SANs over IP. VARs must accept that first-time purchasers will require a long lead cycle.
2. Customers are trying to manage data, and the wealth of technology options is sometimes overwhelming. We want to snapshot, clone, leverage SAN technology for better backups, etc. And the market provides options to implement these functions at various components of the overall infrastructure — NAS devices or gateways, SAN “controller” based software, switch software, host software and in-band appliances. You need to be able to discuss all of these and support as many as possible that might be mature enough and the right fit for the customer’s needs.
3. Don’t assume SAN technology is more mature than it really is. Verify components and test solutions. Invest in a lab. We worked with major vendors and still had serious disconnects about IP replication for our Exchange environment. That complicated and lengthened the overall project. Vendor interoperability is limited. Because SAN technology is new to many customers, they are unwilling to take risks. Today, my HP SAN and my EMC SAN are separate islands. Granted, Brocade is marketing a router that both vendors will eventually support, but that seems like overkill for interoperability. If you are a good VAR, work with the array vendors and the HBA vendors as necessary (in my case, Emulex) to provide your customer with a server HBA driver that can see both arrays and be supported by the array manufacturers.
4. The channel is evolving as the business is growing. My first HP SAN purchase was from my VAR. Since then, HP has developed a robust B-to-B offering for my size enterprise (less than 1,000 employees), and my next MSA purchase will likely be through that channel. The VAR still gets some compensation, in the way of an agent fee. My first EMC purchase was direct, through a vertical group focused on the New York law-firm industry. My next EMC purchase (ATA drives for the current CX 600) will likely be through an EMC reseller (MTI), which was a major Legato dealer. (This same reseller will be helping me with the Legato RepliStor midterm solution for Exchange replication.)
5. Provide 24/7 integration service — we will pay for it. Having my mission-critical systems on EMC storage has been comforting, not only because of the excellent project team, but also because they have a robust, mature, support service — SAC. The HP investments have been more difficult to manage. My local reseller staff mostly keeps banker hours, not IT hours. Even with a prepaid hourly service commitment, they have sent me to HP’s direct support for many of my MSA questions, which is an effective but time-consuming alternative. When I had a hardware problem with the MSL 5052, the VAR would not get engaged; I eventually got proper attention from the local HP service team only after escalation to management and screaming into the phone. Do not leave your customers alone — be the partner you claim you want to be.