Is there gold in them there mountains of data? The average large enterprise has terabytes of data on hand -- customer information, supplier exchanges, and internal company records that contain data. Within this mountain of data lie the golden nuggets that can help solve business problems and propel new strategic initiatives. By putting on a miner's hat, you can better analyze the data you already have on hand and enrich your ability to increase revenues and reduce costs. Advances in both hardware and the capabilities of database management systems make data mining a more compelling proposition today. For example, the plummeting cost of disk storage has enabled enterprises to store more and more data. Likewise, microprocessors keep getting more powerful, while advances in symmetrical multiprocessor technology has removed a lot of the overhead that once limited data mining.
Data mining is not a magic potion or a replacement for good business analysts. Data mining doesn't just hang out on servers watching data for interesting trends, paging a DBA with the results. An extension of traditional statistical analysis, data mining is a process wherein an organization uses analytical tools to uncover hidden patterns and relationships in data that can be used to validate predictions made as a means to solve business problems.
Data mining has broad applicability across a large number of industries. Some enterprises use data mining to drive customer interaction. Tom Brady, president of The Destination Group Digital, uses data mining to identify and sell properties to customers who have stayed in vacation rentals in South Beach, Fla. "We filtered our prospect file down to 7,000 target leads using data mining. We then designed a newsletter to cater to these customers."
Digging for the Golden Nugget
As a process, the steps you take to successfully mine data should be viewed in a circular context rather than as a linear path. Several major steps are core to any data-mining strategy.
The first step, defining the business problem, sounds straightforward enough. However, to best leverage data-mining technology requires that the business problem be stated as precisely as possible.
For example, a business problem stated as "the need to increase sales in the east" will yield inferior results to one stated as "the need to determine how to increase order volume for a line of fishing products in the east." Likewise, asking, "How will offshoring company resources negatively affect the bottom line?" will net a different answer than asking, "How will offshoring company resources affect customer retention?"
David Lease, chief architect at WAM!NET Government Services, notes, "If the data-mining question is too broad, it won't work. The query needs to be narrowed and (you have to have) a specific goal in mind when asking (business) questions."
Constructing the data-mining database itself can take the bulk of the time in a data-mining process, depending on the condition and complexity of the data involved. First, you must determine the location of the data you'll need to construct the data-mining database. Is the data in one or more operational or transactional databases or already contained in a data warehouse?
Once you have identified appropriate sources, describe the data elements available from the sources you chose. You'll want to create a report that outlines the attributes of the data, for example, data type and range of possible values. Then, identify which subset of this data is needed to solve the business problem.
After subsetting the data, analysts will need to explore it for quality to determine what (if any) cleansing will be needed. Cleansing is essential for accurate data-mining results.
The cleansing process accounts for fields that might be missing data, fields that contain incorrect data, and fields with syntactical problems. You may not be able to resolve all issues with your data, but making an attempt to clean it well before mining will improve the chances for a successful outcome.
Analysts next need to determine what (if any) metadata requirements will be needed for mining and then define and execute a process to load the data-mining database. This process should be implemented as repeatable, rather than viewed as an ad-hoc or one-time event, because data changes rapidly.
Once the data-mining database has been constructed, the data must be explored in preparation for modeling. Analysts will need to use OLAP, data-mining exploration aids, and other tools to select variables and rows, and to create derivative variables. This initial data exploration helps determine the best type of model to use for data mining.
A Model That Fits
Several different types of models can be used to mine data. Initial data exploration may at first lead toward one type of model. However, exploration that applies different models to the business problem is warranted to find the one that will yield the most reliable results.
Once a data model has been constructed, it is crucial to verify that it is the best model has been selected for the project at hand. This will likely require a first pass of data mining with a small subset of data from the data-mining database. Examining error rates and the mining results will provide a good indicator of whether the model will solve the business problem accurately.
Another helpful approach is to execute the model against a small subset of live data and compare that to the results from the data-mining database. This is particularly useful when some data elements (say, interest rates) may trigger a different data-mining outcome.
Once the model has been validated and executed, you'll want to view the results and identify actions to be taken, or use the model to add more business rules to existing data sets. This could take the form of a flag, which is set when a particular data set matches the model (credit worthiness, for instance). You'll also need to consider how to maintain your model over time given changes in business and data elements.
Are You Sure That's Mining?
Confusion exists over how data mining relates to data warehousing, data marts, and OLAP. David Smith, product manager at Insightful, explains, "OLAP is all about what happened in the past, as it just shows you (a view) of tables you already have. Only data mining (uses data) to help you predict the future."
Data mining complements technologies, such as data warehouses and OLAP, rather than replacing them. For example, users with a data warehouse have likely already performed data cleansing. Extracting a subset of that data to a data mart for mining is then a fairly simple task.
Many business analysts already use OLAP tools to examine data. If you use traditional query or reporting tools, you can see what your data contains. OLAP tools allow analysts to go further to gain an understanding of certain data pattern outcomes. Examining the income-versus-debt ratio to determine credit worthiness is an example of this capacity, but it requires that the analyst develop a theory and then use OLAP tools to query the data to validate or invalidate it.
By contrast, data mining does not rely on a hypothesis to uncover patterns in data. The data itself is used to identify patterns that may address a business problem. Using data mining to determine credit worthiness, for example, may link income and debt, but it may also identify years of employment as a contributing factor.
OLAP can be used to help theorize the effects of the data-mining outcome (say, of credit worthiness) on the corporate bottom line. Likewise, OLAP technologies can help analysts explore and better understand enterprise data prior to mining. In this regard, OLAP and data mining can work hand in hand.
Selecting Data Mining Tools
There is no shortage of solutions to enable a successful data-mining strategy (see kdnuggets.com for an exhaustive list of commercial tools). You can also find many equally effective open source solutions. Whichever tools you choose, it's key to implement data mining as an ongoing process.
As a core business and technology strategy, data mining can increase revenue and reduce costs, offering a competitive edge in good times or bad. -- InfoWorld (US)