Data Validation: An Introductory Guide
Data validation is a technique used in Data Quality Management to improve a data’s quality by restricting the range of values in a data field to an acceptable set of values that may be entered. Data validation and data standardization differ. Standardizing data transforms it into a known and reliable format, which then can be used to verify the data content’s validity or truthfulness.
What is data validation?
Data validation is the technique of restricting the possible range of values of a data field in order to eliminate entry errors, inconsistencies, and to build context around a data field.
Validation rules applied during data (such as on electronic forms) entry can instantly determine if input data falls within the acceptable range of values and prevent erroneous values. For instance, when recording addresses, each US state has a standardized official code and abbreviation, which can be adopted as the acceptable range of values for state locations. Any value that falls outside of this range is not accepted, whereas any valid value is accepted as input, regardless if it is accurate. (The state code XY would be rejected, and AZ would be accepted because it is in the range of acceptable state abbreviations.)
Validation can be carried out further than simply cross-referencing for a valid input. A clever designer will realize that there are certain patterns that can be validated against, such as addresses which can be validated against zip codes. While a simple validation rule can restrict the range of state abbreviation, it can also open up inaccuracies—while an address may be in Hollywood, it is possible that an incorrect state abbreviation, one other than California (CA), may be inputted. By using zip codes, the designer can create a more accurate scope of validation by comparing it to a higher validation standard. That is because the US Postal Service knows which addresses fall within each zip code, an address is valid if it is in the right zip code. In fact many address validation tools and services exist simply to augment client data systems in this capacity.
Valid data is a key principle of Data Quality, and as the address example shows, validation can become a complex undertaking for which many techniques and tools have been developed to make it more manageable.
What are the rules of data validation?
Data validation is implemented to enforce a high level of data consistency within a data set. In achieving this end, general validation techniques are applied with increasing complexity and scope. Common data validation checks include:
- Data Type Validation — A check designed to validate primitive data fields containing values of the appropriate data type.
- Simple Range and Constraint Validation — A check designed to validate that fields are within a range of values, or follow an allowable character pattern, as with regular expression definitions.
- Code and Cross Reference Validation — A check against an external source, such as standards tables, a look-up table, or a directory information services.
- Structure Validation — A check designed for complex structures that take advantage of processing capabilities, such as testing for conditional constraints.
- Consistency Validation — A check to determine internal logic is not violated, such as placing start dates before end dates and not the other way.
As can be inferred, data formatting standards and consistency rules play a key role in setting up data to be successfully validated.
Rules for Consistency
Consistency rules are used to ensure that data adheres to logical considerations. In essence, these are rules to help enforce data integrity. Common examples of consistency include:
- Dates that follow chronological order
- Words that are spelled correctly
- Passwords that follow strong rules
In the first case, a rule could be written that stated the Start_Date field must not exceed the End_Date field. If this happens, the database will throw an error and ask for it to be corrected.
Formats must be standardized to ensure usability and compatibility when integrating data within and across systems. Formats are an arrangement or specification of encoding data for sharing or storage, for example file and data formats like CSV, JSON and XML.
How is data validated?
Data validation is ideally performed by scripts and tools rather than by manual processes which can continuously introduce human errors. The key is automating validation tools that can do the heavy lifting of checking huge volumes of data to ensure data integrity is in fact correct.
Data validation scripts
For those data scientists who are able to code, data validation scripts can prove to be effective and quick techniques to build automated processes to check data. Validation scripts and validation rules go hand in hand. The script contains the validation logic process by which a single condition is applied to a piece of data in evaluation of its validity. The check is binary, true or false. And multiple data checks can be validated: data type check, code check, range check, format check, consistency check, and uniqueness check.
Data validation tools and software
The alternative to developing data validation scripts is to use out-of-the-box tools and software, or cloud data services to validate and manage data. For many businesses without the expertise in data management, cloud services are an ideal choice.
Cloud Data Services come in many varieties, some offer just data management, others allow sophisticated data quality tools. Data management suites typically combine all the features that IT administrators and development & data teams need in their tasks.
Data Validation vs Data Verification
A point of distinction must be made between data validation and data verification. Data validation has been discussed, with the caveat that valid data may still be inaccurate, in that it may not represent the truth. For example, upon entering address data, valid state data includes all fifty states, which means there are potentially 49 state abbreviations that can be used as valid input, but be inaccurate.
After validating data, then verifying data can begin. By verifying data against reality, values can be modified to represent their trueness. A relevant example is to send a text to a newly recorded phone number, if the phone receives it and then the user replies, then the company has discovered it is both a real valid phone number and a verified user.