dqnowlogo  
section_products_onsection_services_offsection_news_offsection_info_off
ours_boththeirs_off
 
sitemap
 
 

Data Quality = Iterative Refinement

An outsider may view data quality as a single event: run the data through the software. In fact, it's a multi-step process. Why? A key reason is that many quality issues do not have a single "correct" solution. For example, should the names of cities outside of the US be in English or in the local language? Some companies will choose the former; others the latter. Should "B. Ross" and "Betsy Ross" at the same address be considered a duplicate? Probably so for a software company; maybe not for a financial services firm. To account for these issues, data quality products allow you to adjust cleansing settings and specify custom transformations.

Real-world data involves tradeoffs. For cleansing: changes made to fix one problem may introduce errors elsewhere -- the familiar "two steps forward, one step back." For match/dedup: the goal is to adjust settings far enough to maximize the number of desired matches but not so far as to match too many unrelated records. Finding this balance requires a process of iterative refinement: run a representative set of data, evaluate the results, refine the settings, and run again.

 

Their Process

Existing data quality products are point solutions, focused on cleansing and match/dedup. They do not provide effective tools to assist with the process of iterative refinement.

process_their

Some data quality vendors claim to provide support for profiling -- but everything we've seen still requires the data analyst to do most of the work, especially when compared with our use of domain knowledge. (Do you agree or disagree with our analysis? Please emailsend us your experience.)

 

Our Process

Whether used with or instead of existing products, DQ Now fills the above gaps with interactive tools that dramatically reduce the time spent per cycle and total number of cycles.

process_our

(Many of the DQ Now features described here are still in beta test. Current features are listed on the products page.)

(To keep the diagrams manageable, we've included custom transformations with cleansing.)

Steps:

  • Profile raw data to establish a baseline
  • Cleanse the data, review changes, adjust cleansing settings ... repeat
  • Match & dedup cleansed data, review match groups, adjust match/dedup settings ... repeat
  • Cleanse the data, match & dedup, profile SUSPECT data, adjust cleansing and/or match/dedup settings ... repeat
  • Cleanse the data, match & dedup, profile CLEANSED data, adjust cleansing and/or match/dedup settings ... repeat

As another view of the same issue, here's a side-by-side comparison that takes a simplified path through the process.

The Data Quality Process -- A Comparison

  Task Problems with their approach Our solution
1 Profile the data to identify problems. Hard to find all the problems or figure out which ones require attention. Domain knowledge omits problems the engine will fix and organizes the remainder for easy understanding.
2 Adjust cleansing settings and add custom transformations. Awkward, non-standard syntax for custom changes. Regex provides a powerful way to specify custom changes.
3 Cleanse.    
4 Review changes. Spend much too long poring over reports that look like those greenbar printouts from the '70s. Every change to every field is shown, with an overview and detail view.
5 Match & Dedup.    
6 Review match groups. Painstaking effort to understand why records matched and to find items that don't meet company requirements. Match levels are color-coded and character-level differences highlighted.
7 Review cleansed data to identify remaining problems. Hunt around for problems the engine missed. See step 1: we profile the latest "cleansed" data to find additional exceptions that surfaced due to other changes.
  Repeat as needed. Go back to step 2.
("No, not again!")
Go back to step 1: our profiler is an integral part of the process.

For DQ Now, cleansing is a small part of the whole process. We focus on providing useful information in convenient form every step of the way.


Just for completeness, it's worth noting that the above process is usually embedded in a larger one.

  1. Extract data from source.
  2. Import into data quality software.
  3. Profile, cleanse, match and review. (described above)
  4. Export out of data quality software.
  5. Load into destination.

Next step: see how we use domain knowledge to save you time.

 Home 

 Products
   Ours
     Description 
     Iterative Refinement 
     Domain Knowledge 
   Theirs
     Comparison 
     Buy vs. Build 

 Services
   What: Scope 
   How: Our Approach 

 News
   We: Press Releases 
     December 3, 2002 
     June 28, 2002 
     April 9, 2002 
   They
     In the News 
     Customer Quotes 

 Info
   Us
     About Us 
     History 
   Them
     Resources 
     The Market 
     Buzzwords 

 
 

 

DQ Now, DQ Now AUDIT, DQ Now ETL Audit, DQ Now Cleansing Audit, and the DQ Now logo are trademarks of DQ Now.
Copyright 2002 by DQ Now. ALL RIGHTS RESERVED.
Send questions or comments to email2006@DQNow.com.
Products, services, and prices are subject to change without notice.