Get the dirt on your data for squeaky clean info

Incorrect and wrongly classified data is a big, expensive problem. Catherine LaCroix talks to some IT pros about how to rinse things clean

Stampin’ Up is a Utah-based direct sales company that manufactures and distributes rubber stamp supplies. In the summer of 2004, at the end of its highest sales season, its manufacturing resource planning system stopped sending it orders to make more stamps because it was showing that the items were already in stock. But it was inventory that didn’t exist. The culprit: dirty data.

“It took a few days to figure it out,” says Steve Gockley, the company’s manager of web infrastructure, websites, and business intelligence and analytics. “Once it was found, we had to cut work orders and do some emergency manufacturing.”

Lucky for Stampin’ Up, the backlogged products weren’t available from competitors. Otherwise, customers might have gone elsewhere. But employee morale suffered. “It takes a while to build confidence back up,” says Gockley.

Dirty data is data that is incorrect, missing or misplaced. And it’s everywhere. In a 2006 poll of 1,160 knowledge workers by US-based researcher Harris Interactive, 75% of respondents reported having made critical business decisions based on faulty data. “In any company of any size, dirty data is a factor,” says Gockley.

That’s because data is dynamic by nature. Manually entering data, integrating systems or repurposing data, or something as simple as a customer moving, dying or marrying, can mess things up. The trick is to find errors and fix them.

“A lot of determining whether data is dirty is about looking at trends,” says Susan Nonken, financial reporting systems manager at ABB, a provider of power and automation products. If you’re looking at data that seems too good, too bad or just too strange to be true, it probably is.

Also look for errors in data drawn from multiple sources and formats.

Move.com aggregates listings from real estate agents and brokers. “We have to get the data whatever way we can,” says Bill Weir, director of business systems. The result: a high possibility of errors.

Once you know you’ve got faulty data you’ve got to fix it, and the hurdles may be more than technical. People may be reluctant to relinquish control of their data, even to render it more useful. And customers can get impatient. “They want to know why you can’t fix it right away,” says Bita Mathews, data warehousing manager at Move.

Getting executive buy-in can help, as can educating users about the processes and complexities of data warehousing. “We have learned that the frequency of a message is as important as the message itself,” says Mathews.

Another challenge, says Weir, is keeping data clean once you scrub it. That requires some careful decisions about data-quality governance, enforcement and maintenance. The commitment isn’t just to the clean-up, says Weir, but to “how data quality will be enforced, and what the integration and architectural guidelines should be for data quality standards.”

Be prepared for an ongoing process, says Mathews. For example, because Move often finds multiple listings of the same property, she has had to make de-duping (or deleting multiple records) part of her routine.

“It’s not a one-step process,” says Robert Lerner, an analyst at market research firm Heavy Reading.

John Leslie, chief technology officer at Wall Street On Demand, which hosts financial research websites, says that establishing a system of data checks is the key to keeping the company’s data clean. “We’ve built our system so that if a vendor sends us an XML schema that says this field is a date, we will parse it out and validate that it is,” he says.

“Adding in those kinds of checks is very important,” he says. “You can’t just tack quality on at the end; you have to build it throughout the process.”

Any way you slice it, it’s going to take time and money to get your data clean. “We spent a year on this issue,” says Gockley. And Weir says Move has spent US$300,000 (NZ$468,000) to $500,000 in time and technology tools so far.

But it’s worth it. Lerner says return on investment can be calculated by figuring out the costs saved by avoiding the erroneous results and the repair work associated with inaccurate data. But the real payback is having an accurate picture of your organisation and a better understanding of your customers, he says.

Accurate information on sales calls is key for Lee Alaniz, director of sales operations at Realtor.com, a subsidiary of Move. “We now have a higher confidence that we’re not contacting someone about property that was sold two or three weeks ago,” Alaniz says.

There are also less-tangible benefits. “For us, what it really gets into is reputation,” says Leslie. “We’re in the business of helping the individual investor make an informed decision. If the end user loses confidence in the data, it’s our reputation on the line.”

Stampin’ out dirty data

When faulty data caused the manufacturing system at Stampin’ Up to sabotage inventory by sending out faulty signals, Steve Gockley, manager of business intelligence, had to get his hands dirty to fix the problem.

First, he says, he did some “ad hoc stuff” to get two dissimilar systems in sync. He used SQL code to manually update the data between them. At the end of each data integration, the SQL code validated the data and repaired any faulty or changed data.

After that initial fix, which was temporary, Gockley had to identify the source of the bad data and then redesign the automatic integration between the dissimilar systems.

The source of the problem was that the two systems — one providing manufacturing resource planning and purchasing information, the other providing warehouse management and receiving information — were from different vendors and didn’t talk to each other properly.

Gockley resolved that problem by implementing data-quality tools and defining the internal business rules needed to keep the purchasing and warehouse management systems working in tandem.

“As we grab data from each system and start cleaning it, we apply business rules that we know to be true,” says Gockley. “At that point, we can tell if we have dirty data and where the dirty data is from.”

Join the newsletter!

Error: Please check your email address.

Tags managementdataclean

More about BillHarris InteractiveInteractiveWall Street

Show Comments

Market Place

[]