Data profiling is an essential technique in the larger process of data integration within a company's data management. To properly integrate data, an accurate summary of data sets must be made. Wisdom states that there is a fair chance that the actual data structures and content will differ from what is assumed already in place. But even the smallest discrepancies between data sets are not wanted. Fortunately, data profiling and analysis can reveal if a data integration is easy, hard, or impossible by shedding light on just how big the discrepancies are.
In context, a data set comes from a source, so data profiling needs to originate at understanding the data source and what the organization expects of it (which may not be what it produces). An analyst will use a data profiling engine to generate statistics about a particular data set that are used to identify patterns along three general dimensions, its structure, content, and relationships, that can be interpreted by the analyst through the lens of quality. Statistics provide a quick and measurable way to determine data quality. Comparing data expectations and the data profile, will reveal gaps between the two.
It is the discrepancies between data expectations and the data profile that illustrate the quality of the data under inspection. A query of customer data may come with the expectation that 100% of phone number fields are completed, but after profiling it may be discovered to be far less. If this data is critical to the business, as customer data usually is, the analyst has uncovered a good candidate for data quality improvement. (Empty phone number fields need to be completed, and erroneous customer entries must be purged.)
Useful statistics include in a data profile summary: