Purging data saves money, cuts legal risk

Short term-costs lead to long-term benefits, users and analysts say

A funny thing happened on East Carolina University's journey to creating a data-retention strategy. As part of a compliance project launched one and a half years ago, Brent Zimmer, systems specialist at the university, was working with lawyers and archivists to determine which data was most important to keep and for how long. But it soon became clear that it was just as important to identify which data should be thrown away.

Zimmer was aware of the importance of being able to quickly produce required information during litigation, "but the thing we never thought about was keeping data too long", he says. The risk is keeping data that you wouldn't otherwise be required to produce, but which, as long as it's discoverable, it could be used as evidence against you.

Like many organisations, East Carolina had its share of data to purge. "We never made anyone throw away anything unless they ran out of space on their quota," Zimmer says. Some users, he says, had email dating back to 1996.

East Carolina is not unusual; many organisations hang on to more data than they need, for much longer than they should, says John Merryman, services director at GlassHouse Technologies, a storage services provider. One reason is fear, as compliance regulations require storage of certain data.

Another is the low cost of storage. Organisations have historically preferred to buy more disks than spend time and resources sorting through what they do and don't need. "Many people would prefer to throw technology at the problem than address it at a business level by making changes in policies and processes," says Kevin Beaver, founder of consultancy Principle Logic.

But thanks to e-discovery risk and burgeoning data volumes, the tide is starting to turn, according to Merryman. The average cost companies incur for electronic data discovery ranges from US$1 million (NZ$1.5 million) to US$3 million per terabyte of data, according to Glasshouse. While you need to pay attention to retaining data, at the same time, "all indications are that you need to be keeping less", Merryman says.

Aside from the costs, keeping all those records indefinitely is a gold mine for lawyers looking for evidence, he adds.

One way to address the problem of too much data is to set retention policies that reduce exposure to possible legal problems.

For instance, roll all data types — such as email, application and file data — into 10 to 30 categories of big-picture policies rather than hundreds of granular ones. "You need broader rules like 'Accounting data needs to be retained six years,' not 'This annual report needs to be retained [for] five years,'" he says.

According to research from Enterprise Strategy Group, the average required retention period for files, emails and databases is on the rise. Most companies retain data for four to 10 years, says Brian Babineau, a senior analyst at ESG.

East Carolina University started with the low-hanging fruit, setting retention and purging policies for e-mail, medical records and security video. It archived that data on a new system based on Symantec's Enterprise Vault storage management software and EMC's Centera content-addressed storage (CAS) array. Emails from the chancellor or dean are saved for seven years, Zimmer says, while faculty and staff email gets purged after three years.

Meanwhile, security video is archived for 30 days. Patient records from the medical school need to be kept for 20 years after the patient is deceased, so East Carolina uses EMC Rainfinity to take that data off primary storage and archive it to the Centera device so it's out of the backup environment.

Beyond that, the job will get more difficult, Zimmer acknowledges. "There's a lot of other stuff that we don't know the retention [requirements] for, so that will be more tricky," he says.

The key to reducing data volumes, Gartner says, is a process called "content valuation", which involves examining factors such as authorship authority, usage patterns, nature of content and business purpose. According to Gartner, there are many ways to approach content valuation, including electronic records management, content management, enterprise search to identify what's a record and what's not, legal preservation software and policy management.

Partly because of increased data retention activity, companies are increasingly implementing disk-based archiving tiers in their storage architectures. This is a better place to retain data than tape backup systems, Babineau says, because the data is indexed, searchable and stored in single-instance format, all of which makes it easier to find what you need during e-discovery.

According to Robert Stevenson, managing director of storage research at firm The InfoPro, archiving tiers have seen a 54% annual growth rate among users surveyed versus 20% for Tier 1 monolithic storage and 40% growth for Tier 2 modular storage. Tier 1 tends to include high-performance storage platforms, with integrated capabilities for replication, disaster recovery and minimum downtime, he says. Tier 2 includes modular systems with lower cache and disk capabilities, lower cost per terabyte and an emphasis on ease of use, Stevenson adds.

And in the past three years, email archiving has grown, with 48% of survey respondents saying they use it today versus 39% two and a half years ago. Database archiving is also up, with 36% using it versus 21% two and a half years ago.

At East Carolina, Zimmer has reduced primary storage costs by 40%-50% by moving data to the Centera devices.

Another reason for archiving growth is that companies are relying less on backup tapes for retention and more on disk-based storage. "Discovery is a difficult task, and if you have multiple copies in the backup environment, it's extremely expensive to retrieve, index search and take it through the preproduction process of culling and narrowing down results," Merryman says. "It can turn discovery into a multimillion-dollar project."

The seemingly simplest way to reduce data volumes is to delete the data you don't need. But this is much more easily said than done. The fact is, according to Merryman, outside of email, the status quo is to do nothing. "Most legacy applications have never purged data, and new applications are rarely designed to accommodate purging," he says.

Not to mention, he says, deleting production data is complicated. In addition, the issues associated with legal, compliance and operational risks are often ambiguous, and few organisations have a process to accommodate a web of requirements for data retention.

"If you look at legacy data outside the application world, a lot of people have no idea what it is, but they're scared of getting rid of it," he says. At one large bank in New York, Merryman says, he ran across hundreds of file extensions that no one knew about, as well as data inaccessible by currently maintained applications or interfaces.

The important thing is to start setting purging policies now rather than trying to apply them to old data. "If you address high-risk, high-volume applications and databases, you'll address 90% of the risk," he says. "If you target all 700 applications in your environment, you'll never get it done."

In fact, in a tiered storage environment, Merryman says, the business case is much better when data is purged rather than simply archivied on lower cost disk. "The cost of perpetually managing and refreshing huge amounts of data that's never been culled or purged is extremely high," he says.

Unfortunately, he says, most companies that develop tiering strategies figure they'll purge at some time in the future. "But that's the problem with purge," he says. "It's always 'later'".

Another difficulty with purging is the lack of a guarantee that you've deleted all instances of the data set. You might think you deleted all your old email, but it may be stored on tape from two years ago, so it still exists. "Some companies figure if you can't delete it consistently, don't delete it at all because it's probably somewhere that no one knows about," Babineau says.

Still, he says, "if you invest in technology that helps you retain data, why not invest in technology that helps expire data when you don't need it anymore?"

For instance, all archiving systems have a "delete" function, Merryman says, but no single product can purge data across all data types, such as messaging, unstructured and structured data. A fairly mature base of email archiving is available from the likes of Symantec, CA and EMC, as well as smaller companies such as Mimosa and Zantaz. File archiving systems vary widely, from EMC (Legato's hierarchical storage management product) to enterprise search vendors such as Kazeon and Abrevity. And in the database world, archiving vendors include OuterBay and PeopleSoft.

Merryman's advice: First identify vendors with proven technologies, and then look at emerging vendors. Second, he says, see if the vendors support or plan to support the Storage Networking Industry Association's emerging Archiving Standards. "This body of standards is young," he says, "but it's the only industrywide effort to standardise archiving methods."

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags datamanagementpurging

Show Comments