Domain Knowledge
What it is and why it matters
do·main: n. - 2. A sphere of activity, concern, or function...
knowl·edge: n. - 2. Familiarity, awareness, or understanding gained through experience or study.
(The American Heritage® Dictionary of the English Language, Fourth Edition, as quoted by Dictionary.com)
For data quality, "domain knowledge" refers to information about each field (e.g. first name, country) and what makes it valid or invalid: characters (e.g. letters vs. digits), character pattern, range or list of values (e.g. all states), relationship with other fields (e.g. US states vs. Canadian provinces), and more.
Every data cleansing engine uses some form of domain knowledge; DQ Now is the only product that also uses this information at the data profiling stage. They tell you what data a field contains; we tell you what it means.
Let's consider an example. One of the fundamental tools of data profiling is a frequency distribution. Here's an analysis of a real data set. Note that it is relatively clean: there are no misfielded values (e.g. a zip code or city that ended up in the country field) and no typos. Each value is plausible (e.g. "Great Britain"), and might pass a quick human inspection. The data is shown two ways: sorted by frequency and by name. Now, how long does it take for a data analyst (or an end user acting in that capacity) to identify all the data quality problems, and then determine the best way to address them?
|
Their Generic Profile
| Total records |
541 |
| Distinct values |
54 |
| |
| Sorted by Frequency |
|
Sorted by Name |
| Name |
Count |
Percent |
|
Name |
Count |
Percent |
|
|
|
| United Kingdom |
89 |
16.5% |
|
ARGENTINA |
1 |
0.2% |
| Germany |
86 |
15.9% |
|
Argentina |
1 |
0.2% |
| Japan |
64 |
11.8% |
|
Australia |
36 |
6.7% |
| Australia |
36 |
6.7% |
|
Austria |
15 |
2.8% |
| Switzerland |
31 |
5.7% |
|
AUSTRIA |
1 |
0.2% |
| France |
29 |
5.4% |
|
Belgium |
14 |
2.6% |
| Sweden |
21 |
3.9% |
|
BELGIUM |
2 |
0.4% |
| Netherlands |
18 |
3.3% |
|
belgium |
1 |
0.2% |
| Austria |
15 |
2.8% |
|
Brazil |
1 |
0.2% |
| Belgium |
14 |
2.6% |
|
Denmark |
13 |
2.4% |
| Denmark |
13 |
2.4% |
|
England |
3 |
0.6% |
| UK |
13 |
2.4% |
|
Finland |
2 |
0.4% |
| SWEDEN |
10 |
1.8% |
|
France |
29 |
5.4% |
| New Zealand |
9 |
1.7% |
|
FRANCE |
4 |
0.7% |
| Norway |
9 |
1.7% |
|
france |
1 |
0.2% |
| Italy |
8 |
1.5% |
|
Germany |
86 |
15.9% |
| GERMANY |
7 |
1.3% |
|
GERMANY |
7 |
1.3% |
| Spain |
6 |
1.1% |
|
Great Britain |
1 |
0.2% |
| Hong Kong |
5 |
0.9% |
|
Holland |
1 |
0.2% |
| FRANCE |
4 |
0.7% |
|
HOLLAND |
1 |
0.2% |
| England |
3 |
0.6% |
|
Hong Kong |
5 |
0.9% |
| ITALY |
3 |
0.6% |
|
HONG KONG |
1 |
0.2% |
| JAPAN |
3 |
0.6% |
|
Ireland |
2 |
0.4% |
| NETHERLANDS |
3 |
0.6% |
|
Israel |
2 |
0.4% |
| NORWAY |
3 |
0.6% |
|
Italy |
8 |
1.5% |
| Republic of Singapore |
3 |
0.6% |
|
ITALY |
3 |
0.6% |
| The Netherlands |
3 |
0.6% |
|
Japan |
64 |
11.8% |
| BELGIUM |
2 |
0.4% |
|
JAPAN |
3 |
0.6% |
| Finland |
2 |
0.4% |
|
Latvia |
1 |
0.2% |
| Ireland |
2 |
0.4% |
|
MALAYSIA |
1 |
0.2% |
| Israel |
2 |
0.4% |
|
N. Ireland |
1 |
0.2% |
| Philippines |
2 |
0.4% |
|
Netherlands |
18 |
3.3% |
| Singapore |
2 |
0.4% |
|
NETHERLANDS |
3 |
0.6% |
| ARGENTINA |
1 |
0.2% |
|
New Zealand |
9 |
1.7% |
| Argentina |
1 |
0.2% |
|
NEW ZEALAND |
1 |
0.2% |
| AUSTRIA |
1 |
0.2% |
|
Norway |
9 |
1.7% |
| belgium |
1 |
0.2% |
|
NORWAY |
3 |
0.6% |
| Brazil |
1 |
0.2% |
|
Philippines |
2 |
0.4% |
| france |
1 |
0.2% |
|
Portugal |
1 |
0.2% |
| Great Britain |
1 |
0.2% |
|
Republic of Singapore |
3 |
0.6% |
| Holland |
1 |
0.2% |
|
Singapore |
2 |
0.4% |
| HOLLAND |
1 |
0.2% |
|
South Africa |
1 |
0.2% |
| HONG KONG |
1 |
0.2% |
|
Spain |
6 |
1.1% |
| Latvia |
1 |
0.2% |
|
Sweden |
21 |
3.9% |
| MALAYSIA |
1 |
0.2% |
|
SWEDEN |
10 |
1.8% |
| N. Ireland |
1 |
0.2% |
|
Switzerland |
31 |
5.7% |
| NEW ZEALAND |
1 |
0.2% |
|
SWITZERLAND |
1 |
0.2% |
| Portugal |
1 |
0.2% |
|
switzerland |
1 |
0.2% |
| South Africa |
1 |
0.2% |
|
TAIWAN |
1 |
0.2% |
| switzerland |
1 |
0.2% |
|
THAILAND |
1 |
0.2% |
| SWITZERLAND |
1 |
0.2% |
|
The Netherlands |
3 |
0.6% |
| TAIWAN |
1 |
0.2% |
|
UK |
13 |
2.4% |
| THAILAND |
1 |
0.2% |
|
uk |
1 |
0.2% |
| uk |
1 |
0.2% |
|
United Kingdom |
89 |
16.5% |
|
The answer: too long!
(Many of the DQ Now features described here are still in beta test. Current features are listed on the products page.)
DQ Now uses domain knowledge to profile the data, showing a much more useful picture. The overview report is simple:
|
DQ Now Profile Summary
541 records
29 countries represented
26 values were corrected
60 values were standardized to mixed case
|
And, most importantly, zero items require the data analyst's attention! We've taken a task that requires several minutes of detail work, and reduced it to the few seconds required to skim our summary report. That's productivity! If there were any remaining problems (e.g. an unrecognized value in the country field), they would be clearly identified rather than remaining hidden in a generic distribution report.
How did we do it? There are two classes of problems: those that the cleansing engine will fix (using current settings) and those it won't. Other profiling tools do not separate these two cases, so the data analyst is forced to deal with both in the same profile -- with no way of knowing which problems fall into which category. To use an analogy from another field: a conventional data profile is full of noise (issues that don't require attention) so it's very difficult to discern the signal (issues that the cleansing engine won't fix on its own).
If the data analyst wants to validate the corrections, they can zoom into a detail report to get all the information at a glance:
|
DQ Now Profile Detail
26 values were corrected (details below)
60 values were standardized to mixed case
| |
|
|
|
|
|
|
New Value |
|
Old Value |
Correction Rule |
| |
|
|
|
|
|
| 18 |
|
United Kingdom |
|
| |
|
13 |
|
UK |
standard abbreviation |
| |
|
3 |
|
England |
common alternative |
| |
|
2 |
|
Great Britain |
common alternative |
| |
|
|
|
|
|
| 5 |
|
Netherlands |
|
| |
|
3 |
|
The Netherlands |
common alternative |
| |
|
2 |
|
Holland |
common alternative |
| |
|
|
|
|
|
| 3 |
|
Singapore |
|
| |
|
3 |
|
Republic of Singapore |
common alternative |
|
After DQ Now applied domain knowledge and discovered that no values require user intervention, a frequency distribution is no longer useful for understanding data quality. (It may be useful for the marketing department to understand where customers are, but that's a separate issue.) Nevertheless, DQ Now is happy to create one; notice how much simpler it is?
|
DQ Now Frequency Distribution
541 records
29 countries represented
| Sorted by Frequency |
|
Sorted by Name |
| Value |
Count |
Percent |
|
Value |
Count |
Percent |
|
|
|
| United Kingdom |
107 |
19.8% |
|
Argentina |
2 |
0.4% |
| Germany |
93 |
17.2% |
|
Australia |
36 |
6.7% |
| Japan |
67 |
12.4% |
|
Austria |
16 |
3.0% |
| Australia |
36 |
6.7% |
|
Belgium |
18 |
3.3% |
| France |
34 |
6.3% |
|
Brazil |
1 |
0.2% |
| Switzerland |
33 |
6.1% |
|
Denmark |
13 |
2.4% |
| Sweden |
31 |
5.7% |
|
Finland |
2 |
0.4% |
| Netherlands |
26 |
4.8% |
|
France |
34 |
6.3% |
| Belgium |
18 |
3.3% |
|
Germany |
93 |
17.2% |
| Austria |
16 |
3.0% |
|
Hong Kong |
6 |
1.1% |
| Denmark |
13 |
2.4% |
|
Ireland |
2 |
0.4% |
| Norway |
12 |
2.2% |
|
Israel |
2 |
0.4% |
| Italy |
11 |
2.0% |
|
Italy |
11 |
2.0% |
| New Zealand |
10 |
1.8% |
|
Japan |
67 |
12.4% |
| Hong Kong |
6 |
1.1% |
|
Latvia |
1 |
0.2% |
| Spain |
6 |
1.1% |
|
Malaysia |
1 |
0.2% |
| Singapore |
5 |
0.9% |
|
Netherlands |
26 |
4.8% |
| Argentina |
2 |
0.4% |
|
New Zealand |
10 |
1.8% |
| Finland |
2 |
0.4% |
|
Norway |
12 |
2.2% |
| Ireland |
2 |
0.4% |
|
Philippines |
2 |
0.4% |
| Israel |
2 |
0.4% |
|
Portugal |
1 |
0.2% |
| Philippines |
2 |
0.4% |
|
Singapore |
5 |
0.9% |
| Brazil |
1 |
0.2% |
|
South Africa |
1 |
0.2% |
| Latvia |
1 |
0.2% |
|
Spain |
6 |
1.1% |
| Malaysia |
1 |
0.2% |
|
Sweden |
31 |
5.7% |
| Portugal |
1 |
0.2% |
|
Switzerland |
33 |
6.1% |
| South Africa |
1 |
0.2% |
|
Taiwan |
1 |
0.2% |
| Taiwan |
1 |
0.2% |
|
Thailand |
1 |
0.2% |
| Thailand |
1 |
0.2% |
|
United Kingdom |
107 |
19.8% |
|
Next step: see how DQ Now can be used instead of and in addition to related products.
|