What is Data Integrity?
Data integrity is the assurance that data meets quality standards at all times. Data consumers interact with data with the assumption that it is reliable. Because of this perception, it is also not difficult to imagine data consumers that believe the data they use is immutable and truthful. So much trust they have in data that they make life decisions based on it. Therefore, it is important to ensure that data is reliable.
Accurate data represents data faithfully. Complete data remains in its full form, unfiltered. Consistent data remains unchanged no matter how it is accessed or stored (data can of course be transformed to improve its quality, but data consumers should have access to the same consistent data throughout the system). And valid data has been verified as accurately representing reality, e.g. phone numbers reach real phones, and email addresses are not fake and don’t bounce back.
In short, data integrity is the state of being whole, and protected against improper alterations.
Why is data integrity important?
Today, the data landscape is enormous. Data permeates all aspects of our lives, and those companies that understand this and leverage the data they create into actionable business insights are a step ahead. Data integrity is important because the deluge of data there creates much noise. The noise must be refined out of the data to make it useful. After it is refined, cleaned, combined, and transformed into a format that is usable (useful for analysis and insights), it must be maintained with that integrity so it can continue to be a highly valued and productive asset.
What are the types of data integrity?
A holistic perspective on how data is kept within an organization is required to comprehensively safeguard data integrity. So while there is emphasis on managing data sets and data access, in order to ensure data remains faithful, accurate, complete, consistent, and secure, where and how data is physically stored must also be considered. Broadly, logical data integrity and physical data integrity must be considered.
Maintaining data integrity begins with protecting and securing the physical infrastructure where data lives. This generally means taking protective measures over memory and storage hardware, but as more data services are operated from the cloud, this becomes less of an organizational in-house responsibility. In fact, in many cases cloud providers offer superior data services compared to the capabilities of many small IT teams—companies can save tremendously in both costs and time by judicially shopping for cloud data services. In other cases, cloud providers can augment the data strategy of enterprises in need of diversifying their global data infrastructure.
In either case, the providers assume responsibility for ensuring data integrity by protecting physical infrastructure from multiple threats:
- Infrastructure and hardware faults and failures
- Design failures
- Environmental impacts that cause equipment deterioration
- Power outages and disruptions
- Natural disasters
- Environmental extremes
Some key techniques that cloud providers and enterprises may design into their systems to ensure infrastructure protection and thereby data integrity include:
- RAID or other redundant storage systems with battery protection
- Error-correction memory
- Cluster and distributed file systems
- Error detection algorithms to protect data in transit
- Generator back-ups
Redundancy is clearly a key principle for the physical protection of data, and assuming a perfect physical system, we can turn to the logical aspect of data integrity.
Logical data integrity is concerned with the data itself. Whether that data makes sense given business context and trends, and whether adjustments must be made because assumptions have changed. In essence it's concerned with the fitness of the data to real life needs. At a technical level, logical integrity is concerned with four main principles.
- Entity integrity — establishing rules that there are no duplicate data elements and no missing critical fields.
- Referential integrity — establishing rules for how data is stored, preventing data duplications, and controlling authorization for data modifications.
- Domain integrity — ensuring that formats, types, value ranges and other attributes fall within acceptable parameters.
- User-defined integrity — additional rules, such as business rules, that help to further constrain the data to the needs of the organization.
How is data integrity ensured?
To ensure data integrity requires a framework that not only takes into account infrastructure protection and logical data rules, but also extends into the human aspect of the organization. The following broad guidelines can help organizations maintain strong data integrity.
- Employee Training — Employees must be kept in the know about data responsibilities. Companies implementing data governance in the right way will have established a data governance steering committee, a data governance council, as well as data stewards responsible for specific data and processes.
- Establish a Data Integrity First Culture — With the establishment of training for employees, the company culture will be on its way to becoming a Data Integrity First Culture, which means they should be a data-first culture, valuing data as an asset. With the right buy-in, data is treated as the asset it can be, employees look for uses with it, and the company can work to improve its understanding of its data. Without buy-in, data efforts will flounder.
- Validate Data — Data validation is a non-starter. All data should be validated for acceptability before entering into the system. Integration scenarios will increase because more businesses are buying data from third-parties, such as performance marketers who may be paid per email they deliver. Ensuring these emails are valid helps to reduce payouts for false emails while building a reliable email database.
- Reasonably Cleansed Data — Preprocessing of data helps to remove duplicate entries, and other errors. However, any cleansing and standardization should produce a data set that is reasonably more reliable and useful than its previous state.
- Back-up and Protect Data — Periodic back-ups, especially in multiple locations, is a standard practice for protecting data integrity. The top tier cloud services automatically do back-up and data protection.
- Implement Security — Cloud providers implement all the accepted industry security features, encryption, authentication and authorization, and access logs, that enterprises should use in order to secure their data.
Examples of data integrity issues
High data integrity results from establishing proper data management and governance practices. People, processes, and technology continue to play the same roles across many domains, and in data, by focusing on improving systems in these areas, data integrity will be improved and ultimately the insights drawn on that data can propel business decisions.
The following are seven common data integrity issues that hinder data integrity.
- Poor or Missing Data Integration Tools — There are a number of data integration tools on the market for any sized business which makes data handling worthwhile. Analyzing multiple data sources without these tools can quickly become cumbersome, and error prone. Furthermore, adopting standard data integration tools eliminates many of the data integrity issues discussed below.
- Manual Data Entry and Data Collection Processes — Manual processes introduce human error. Reduce errors by installing data validation, standardization, and other data checks. Ideally, machine data entry is much more reliable.
- Multiple Tools to Process and Analyze Data — Tools come and go as new application technologies emerge, so it's not unusual to accumulate many tools to support data operations. It’s simply a good idea to audit these tools periodically as to their usefulness, their conflicts with other tools, and the potential for consolidating tools into newer more advanced and capable data platforms.
- Poor Audit Records — Data adjustments and changes to data integrity rules all must be documented. Failure to record a historical audit of data changes leaves companies in the dark about their progress and will negatively impact operations and data analysis. Without reference, changes to data are made blind.
- Legacy Systems — Like software outgrowing hardware, data outgrows software systems and the infrastructure they run on. Legacy systems eventually fall behind, it is advised to be proactive about upgrading data infrastructures before it becomes costly to switch.
- Inadequate Data Training — Data is complex, and every stakeholder should have the appropriate training to ensure handling data properly. Adequate data training ensures that all users understand both upstream and downstream needs and responsibilities of data stakeholders, and they know their own responsibilities that help support data as a critical, strategic asset.
- Inadequate Data Security and Maintenance — Security and maintenance leaks introduce process flaws that can corrupt data. Employees logging in on the same user ID can introduce data errors by accident, let alone loose credentials can allow intruders to steal or delete data.
Some or all of these operations above can be handled by data quality tools or Master Data Management platforms.