What is Data Standardization?
Data standardization is the data quality process of transforming data to fit a predefined and constrained set of values, relying on the power of uniformity to improve data efficiencies. Closely related is data cleansing, a process to eject bad data by eliminating erroneous entries, duplicates, and bad values, but also by aligning data sets with data standards and data rules to make data more contextual.
Data standardization is a key technique, and a special case of data transformation, used on data to improve its quality in order to provide acceptable input for downstream data operations.
Entity resolution is one notable downstream operation which requires data to conform to standards in order to match relevant entities from multiple disparate data sources using their data fields. Without a standard of reliable comparison, integrating data between data systems would lead to data inconsistencies.
What is the challenge of data standardization?
The challenge that data standardization is confronted with is when analyzing data or merging data from different sources. When merging, sources will generally not have recorded data in the same way, with the same names, or in the same structural configurations. Meaning, while a customer’s name may be a standard data point, there can be many ways for a data handler to retain that data: is it stored as one data field, or multiple? Or, does it allow for titles, honorifics, and monikers?
Two other commonly cited examples that have been standardized are phone numbers, and addresses. Phone numbers adhere to standards such as area codes, country codes, local numbers, and formatting like hyphenations. Addresses are standardized using elements like postal codes, street abbreviations, and address numbers. In fact, addresses are standardized at the national level. In the US the United States Thoroughfare, Landmark, and Postal Address Data Standard has codified and documented address content, address classifications, data quality and exchange standards so that entities operating in the US have a uniform location that everyone understands.
What are standards in data standardization?
The United States Thoroughfare, Landmark, and Postal Address Data Standard illustrates a data standard for addresses in the US; an attempt to have a highly uniform, accurate, and complete list to bring users in alignment. Because it fits these qualities to a high degree, it's used by the public at large and by states. Therefore, standards can be seen as benchmarks of quality and correctness.
Specifically, a standard is a set of codified assertions that a company establishes and then expects as the reliable default format. These codes must be enforced on data through the use of established data handling rules and data tools. Rules can be set to check data entry, like enforcing the correct state code instead of allowing typed input, or rules can be applied at later times like for batch processing to standardized data in bulk.
At a high level, say analyzing a customer contact information data set for completeness, an overall standard of 100% completion may be set. Setting a strict standard like this may go to support marketing and sales efforts by ensuring that every customer’s contact information is known and they can be reliably outreached to. This then makes the connection on how data quality impacts the business’s bottom line. Poor data quality leads to poor marketing and sales results, and therefore reduced bottom line.
Standards can dictate the attributes of data fields themselves; they can govern the relationships certain entities can have with others; and they can restrict values allowed in a format. Even today, AI and machine learning are being used to algorithmically understand patterns within data and establish the inherent rules. As a company understands its data over time, the standard can be improved. These rules are coded and enforced through the use of Master Data Management platforms.
Why is data standardization important?
Data standardization is necessary in order to improve the process of entity resolution that is critical to Master Data Management. Only through the process of matching and merging records from disparate datasets can the construction of the Master Data set be possible. Without standardized data there is no reliable comparative understanding between data sets. And Without entity resolution, there is no reliable tie between entries in separate databases.
Standardization vs. normalization
Data standardization and data normalization both intend to improve data quality but do so by different approaches. Data standardization enforces rules and definitions on data, whereas database normalization is designing a database structure to reduce data redundancy and improve data integrity. Because this means creating tables and relationships in ways that eliminate redundancy and inconsistencies, normalizing a database after data entry is not advisable as it can further cause integrity issues. This means that normalization is typically first considered during the design phase of database development.
Normalization contributes to the elimination of database designs that produce modification anomalies. When a database design that breaks normalization rules is modified, either by insertion, deletion or update, certain side-effects may occur due to its structure.
- An insertion anomaly can occur in circumstances where data entry cannot be recorded because it is simply in violation of the database structure. Such as, at the point of data insertion, data is missing, but the table requires complete data, then the insertion fails.
- An update anomaly occurs in circumstances when there may be multiple records for the same entity. If data is updated, it must be updated in all congruent entries otherwise the system will soon add its own inconsistencies into its data.
- A deletion anomaly can occur when dependencies are not normalized. For instance, in a database where courses are attached to teachers, unassigning a teacher mid-semester from a course could very well delete those course entries from the database and subsequently remove those courses from student curriculums.
Normalization is the process of structuring a database using “normal forms” that enforce an internal logic which protects against losing data integrity. Normal forms are progress design guidelines, meaning that each normalization level builds upon the previous normal forms, enforcing stricter data integrity and reinforcing good data standards.
An unnormalized database is labeled UNF. Applying the first normal form, or 1NF, to a database eliminates nesting relations into tables, and instead replaces it with the use of foreign keys to increase flexibility and data independence. After establishing 1NF, the second normal form, or 2NF, can be applied, which demands that non-prime attributes depend only on candidate keys. The third normal form, or 3NF, is achieved when all attributes are functionally dependent solely on the primary key. At this point, there are further normal forms 4NF, 5NF, and 6NF, but applying up to the third normal form on database designs categorizes it as sufficiently ‘normalized’, and adequately free from any modification anomalies. Other normal forms exist in special or academic uses.
How to standardize data
Today, solutions exist that simplify the process of standardizing data. Through an MDM platform and other data tools, data sources can be acquired and profiled, then business rules can be applied to source data during standardization processes to prepare data for integration. Three main categories of business rules are used:
Taxonomy Rules — Taxonomy rules help to enforce a hierarchy to data, eliminating invalid entries outside the data value range. A product may only have three color options to choose from, and so no other values should be present within that field other than those three options.
Reshape Rules — Reshaping rules reconfigure incoming data into useful structures in the MDM system. If a source decides that it only wants to send a compilation of its data rather than separate data tables, then reshaping rules can reconfigure the table into something more suitable to the aggregating system. For example, the compilation data set may contain the information for three different tables in the target system, reshape rules would then convert the source data into three target tables.
Semantic Rules — Semantic rules peer into the data itself to describe the domain of that data. This is necessary because operational context is unique to each business, and so providers simply use their own definitions without considering that context. The job of standardization during data aggregation is to turn that data into something that makes sense for the business.
What are examples of data cleansing?
Data standardization is a special case of data transformation which maps data values from their source into new values for the target.
Data transformation is a fundamental aspect of data management and integration. In the context of Master Data Management, data standardization is data transformations that follow rules that dictate how data should be transformed into useful formats.
For example, standardizing telephone numbers to the standard telephone number pattern in the United States (country code then area code then local number) helps to cleanse these data fields of erroneous entries.
The following table demonstrates data that has been standardized to the standard U.S. telephone pattern.
|Unclean Phone Number Data||Phone Numbers After Standardization|
|123 456 7890||123-456-7890|
|123 456 789 0||123-456-7890|
|12 34 56 789 0||123-456-7890|
|Call me 1234567890||123-456-7890|
|+1 123 456 7890||(+1) 123-456-7890|
|(91) 891 123 456 7||(+91) 891-123-4567|
|(91) (891) 123 4567||(+91) 891-123-4567|
|Call me 12345||000-001-2345|
|My mobile is||Skip|
In this simple example, the left column, “Unclean Phone Number Data”, representing data to be imported into the company data systems must first undergo a standardization process. While the output in the right column seems simple to arrive at, realistically the system would need to first cleanse these entries, standardize the data within fields, then validate if the content is truthful before it would be useful to the company.