Six Validation Techniques to Improve Your Data Quality

Author: Wes Flores

Publish Date:February 19th, 2016

Copyright 2016 by TDWI, a division of 1105 Media, Inc. Reprinted by permission of TDWI.  Visit for more information

Have you ever had a set of reports that were distributed for years only to have your business users discover that the reports have been wrong all along and consequently lose trust in your data warehouse environment? Gaining trust is the foundation of user adoption and business value of your data management program.

Taking data quality seriously can be difficult if agility and speed-to-market are the name of the game for your business users. This is a lesson that is very costly to learn the hard way. How do you prevent these issues from occurring? After all, it's not like you didn't have validation checks as part of your standard process.

Full data-quality frameworks can be time-consuming and costly to establish. The costs are lower if you institute your data quality steps upfront in your original design process, but it is a valuable exercise to review and overhaul your data quality practices if you only have basic checks in place today.

Here are a few data validation techniques that may be missing in your environment.

Source system loop back verification: In this technique, you perform aggregate-based verifications of your subject areas and ensure it matches the originating data source. For example, if you are pulling information from a billing system, you can take total billing for a single day and ensure totals match on the data warehouse as well. Although this seems like an easy validation to perform, enterprises don't often use it. The technique can be one of the best ways to ensure completeness of data.

Ongoing source-to-source verification: What happens if a data integrity issue from your source system gets carried over to your data warehouse? Ultimately, you can expect that the data warehouse will share in the negative exposure of that issue, even if the data warehouse was not the cause. One way you can help catch these problems before they fester is to have an approximate verification across multiple source systems or compare similar information at different stages of your business life cycle. This can be a meaningful way to catch non data warehouse issues and protect the integrity of the information you send from the data warehouse.

Data-Issue tracking: By centrally tracking all of your issues in one place, you can find recurring issues, reveal riskier subject areas, and help ensure proper preventive measures have been applied. Making it easy for business users and IT to input issues and report on them is required for effective tracking.

Data certification: Performing up-front data validation before you add it to your data warehouse, including the use of data profiling tools, is a very important technique. It can add noticeable time to integrate new data sources into your data warehouse, but the long-term benefits of this step greatly enhance the value of the data warehouse and trust in your information.

Statistics collection: Ensuring you maintain statistics for the full life cycle of your data will arm you with meaningful information to create alarms for unexpected results. Whether you have an in-house-developed statistics collection process or you rely upon metadata captured with your transformation program, you need to ensure you can set alarms based upon trending. For example, if your loads are typically a specific size day in and day out and suddenly the volume shrinks in half, this should set off an alert and a full investigation should occur for such events. This situation may not trigger your typical checks, so it's a great way to find those difficult-to-catch situations on an automated basis.

Workflow management: Thinking properly about data quality while you design your data integration flows and overall workflows can allow for catching issues quickly and efficiently. Performing checks along the way gives you more advanced options to resolve the issue quickly. One example of this is to have strong stop and restart processes built into your workflow so that as an issue is found in the loading process, it can trigger a restart and determine if the issue was environment based. Enabling autocorrection of common challenges related to performance and environmental factors. It also can enable data quality checks that are only possible at the sub-task level during the processing window.

A Final Word

Although there may not be any truly foolproof way to prevent all data integrity issues, making data quality part of the DNA of your environment will go a long way in gaining the trust of your user community.