Understanding Data Profiling
Data profiling is a form of statistical analysis used to inspect data sets and determine their inherent quality. A profile is a summary of the characteristics of the object under inspection, in this case, the data set.
What exactly is data profiling?
Data profiling is an essential technique in the larger process of data integration within a company’s data management. To properly integrate data, an accurate summary of data sets must be made. Wisdom states that there is a fair chance that the actual data structures and content will differ from what is assumed already in place. But even the smallest discrepancies between data sets are not wanted. Fortunately, data profiling and analysis can reveal if a data integration is easy, hard, or impossible by shedding light on just how big the discrepancies are.
In context, a data set comes from a source, so data profiling needs to originate at understanding the data source and what the organization expects of it (which may not be what it produces). An analyst will use a data profiling engine to generate statistics about a particular data set that are used to identify patterns along three general dimensions, its structure, content, and relationships, that can be interpreted by the analyst through the lens of quality. Statistics provide a quick and measurable way to determine data quality. Comparing data expectations and the data profile, will reveal gaps between the two.
It is the discrepancies between data expectations and the data profile that illustrate the quality of the data under inspection. A query of customer data may come with the expectation that 100% of phone number fields are completed, but after profiling it may be discovered to be far less. If this data is critical to the business, as customer data usually is, the analyst has uncovered a good candidate for data quality improvement. (Empty phone number fields need to be completed, and erroneous customer entries must be purged.)
Useful statistics include in a data profile summary:
1. Count of Null Values — Null values are a negative from a normalization perspective. Since null values equal nothing they have little general value and contradict the principle of ensuring all data is valid and valuable.
2. Max and Min Values — Limiting a value set range also helps to determine outliers.
3. Max and Min Lengths — Similarly, max and min limits on length helps to determine outliers and inconsistencies.
4. Data Types and Formats — Profiling data types and formats reveals the level of data non-conformance, and may also identify unexpected formats.
5. Frequency Distribution of Values — Analysts can use frequency distributions for individual columns to assess the reasonability of underlying data.
Data profiling extends into analyzing the relationships between data by examining overlaps between columns and tables.
- Cross-Column Analysis — This analysis compares columns in a data set to identify any duplicates, overlap, or reveal embedded value dependencies.
- Inter-Table Analysis — Likewise, inter-table analysis identifies overlap between tables, and foreign key relationships.
In a way data profiling determines the “fitness” of data by comparing it against documented business rules, and how well the data conforms to those rules (business rules can often be reversed engineered from data sets too). Further, provided that there is a definitive set of values determined to be true that can be used for comparison, like a standard, even assessing data accuracy can be done during profiling through content analysis.
Data profiling is a beginning step of data quality analysis. Profiling establishes a foundation to explore solving potential problems. Combined with other analytics, like business processes analysis, efforts can target root causes of persistent poor data quality, and remedy them, improving data quality over time.
Data mining vs. data profiling
Both data mining and data profiling use analysis of data sets to achieve their goals. Data profiling, as stated above, is used to determine the characteristics of a data set, e.g. the range of values in a column, and the relationships between sets.
Data mining sounds similar because it is about finding patterns and correlations within big data sets to discover probable trends and predict outcomes. And while pattern finding is involved in both, data mining deploys artificial intelligence and machine learning to sift through vast amounts of data and may even tap into unstructured data. Mining is akin to seeing the crowd for the people and the patterns in it; profiling contrastingly would focus on the individuals themselves to determine their personality characteristics.
Data profiling is used to determine the fitness of data before integration, and data mining is used to dig deep into that data to make it valuable by uncovering actionable business insights.
What are the types of data profiling?
Data profiling focuses on three primary areas for analysis, structural discovery, content discovery, and relationship discovery. These three views together form a comprehensive summary profile and accurately summarize the characteristics of data and the data set.
Structure discovery
Analysts use structure discovery in data profiling to construct the basic statistics of a data set. This is the shape of the data set by summarizing common attributes: column min/max values, column lengths of data, column data types and formats. In this stage, the emphasis is on ensuring that the structures are consistent between data sets prepared for integration rather than what the data is inside. Ideally, the easiest way to integrate data should be one-to-one, using the same exact formats and types, like transferring a file across desktop computers running the same file system.
Relationship discovery
Improper data relationships have the power to break data systems. Relationship discovery analyzes the intra-relationships between columns in a data set, as well as the inter-relations between associated tables. This is important for discovering data overlaps or uncovering connections that may break normalization rules.
Content discovery
Content discovery peers into the data itself. By diving deeper into the actual content, analysts can generate frequency distribution statistics, they can understand the percentage of fields that conform to standards, and they can infer the validity, completeness, or ambiguity of data. Content discovery’s emphasis is on the quality of the data, its standardization, and proper formatting.
What is the data profiling process?
Data profiling is an upstream data integration process intent on identifying and fixing data quality issues before further data analysis, for example, in downstream data mining operations. An efficient way of running profiles is to follow the following three general steps, and to use an MDM platform equipped with profiling tools.
- Collect multiple data sources and associated metadata.
- Cleanse data to eliminate duplicates, find anomalies, unify the structure, and discover interrelationships.
- Run statistical reports to uncover structural, content, and relationships information.
Data profiling reports help to form the foundation of Data Quality Reporting.
Develop Data Quality Reporting
Assessing data quality and further resolving data issues and fixing contributors to poor data cannot be achieved in isolation. By developing data quality reporting mechanisms that can be shared with data consumers, they can understand and contribute to improving and organizations data, and subsequently improving the business overall. Reporting should focus on:
- Data Quality Scorecard
- Data Quality Trends
- SLA Metrics
- Data Quality Issues Management
- Data Quality Team Performance against Governance Policies
- IT Team Performance against Data Quality Policies
- Improvement Results Positive and Negative
Data profiling tools
Because it’s the first critical step in ensuring data quality, data quality software incorporates the data profiling tools used to help uncover data quality issues, like duplications, inconsistencies, inaccuracies, and incompleteness. Furthermore, data profiling functionality may also be found in master data management software, data integration software, and big data software. Additionally, open-source profiling tools exist that can provide cost effective tools for bootstrapped companies.