Understanding Entity Resolution

Entity resolution is the process of determining if two data entries actually represent the same real object. This makes entity resolution a decision making process. This process is done at the entity level, but can be scaled to accommodate big data. Because entity resolution is a process at the entity level, there is a significant space for proprietary approaches that differ in quality and speed.

Group of people meeting with technology.

What exactly is entity resolution?

Entity resolution is a key process step in Master Data Management. Master Data acts as a superset of a company’s overall data, tying together data from disparate sources that potentially refer to the same unique entities. Such entities like customers, products, and suppliers can be represented in multiple databases and used by different departments, but may not be represented with the same data structures, or even referential names. Furthermore data within data structures may not be formatted consistently, this lack of standardization contributes to poor data integrity.

Close Up: Anonymous Businesswoman Analyzing Statistical Business Reports On Her Tablet PC At The Office
Businessman And Healthcare Workers Using Laptop On A Meeting In The Office.

What is dynamic entity resolution?

In entity resolution, the process of matching different data points that could represent a single entity is called similarity analysis, and it’s an ever improving field. There are three common approaches to a similarity analysis each with increasing complexity: traditional matching, which focuses on directly matching records but yields poor results; batch entity resolution, which constructs better results into a single view of entities, and real-time entity resolution, which constructs a single view which remains current.

The next evolution of similarity analysis, referred to as dynamic entity resolution, emphasizes the regeneration of entity views from underlying raw data in real-time with respect to specific use case requirements. Similar to real-time entity resolution in that it remains current, dynamic entity resolution also remains more relevant.

In some use cases broader or tighter targeting or specificity may be required. So, the premise of regenerating entities is to allow different combinations of matching criteria for individual entities instead of assuming that one criteria of an entity can fit all use cases. In essence, dynamic entity resolution allows different fuzziness levels that fulfill data access and application requirements. This has become beneficial for enterprise-level data solutions supporting multiple use cases.

Why is entity resolution important?

Entity resolution is critical to Master Data Management. Only through the process of matching and merging records from disparate datasets can the construction of the Master Data set be possible. Without entity resolution, there is no reliable tie between entries in separate databases, and therefore any potential insights that can be drawn from combining them are simply squandered (because data is likely not to be combined with great accuracy). In essence the output of an entity resolution process is the Master Data record.

Group Of Creative Business People

What is the process of entity resolution?

Entity resolution is a step within the larger process of the Master Data Management key processing model, of which each stage overlaps and impacts overall data quality. Entity resolution effectiveness should not be considered in isolation. A comprehensive MDM environment will include the following processing steps.

Data Model Management — Master Data is purpose built to transcend complications of inconsistent data that lead to poor understanding. The solution is to establish clear and consistent logical data definitions within the context of the business. Then data systems should be made to speak this language between each other.

Another established method is to use globally unique identifiers (GUID) that represent an entity and reference data can be associated through this GUID. In this way the data model overcomes the dependency on system speak, a principle which should also extend down to attributes that describe data within systems.

Data Acquisition — New data sources, and data within those sources may be inconsistent. Because of these external and internal inconsistencies, establishing a reliable, repeatable data acquisition process will support the ability to effectively manage and improve entity resolution activities, like validating, standardizing and enriching data.

Data Validation, Standardization and Enrichment — At a minimum to ensure good data consistency, validation, standardization and data enrichment should be implemented. Validation aims to eliminate erroneous data entries, like fake emails. Standardization conforms data to known values (like country codes), formats (like telephone numbers), and fields (like addresses). While data enrichment improves the process by adding useful attributes that aid in more accurate entity resolution. This results in cleansed data ready for entity resolution.

Entity Resolution — Entity resolution consists of a general workflow that subjects the validated and standardized data to a set of match rules which determine how to proceed based on deterministic and probabilistic matching algorithms. Similar entries are treated according to their score. Entities with scores that signal tight similarity may be automatically resolved, others that are fuzzier may be sent to a data steward for resolution. And still, entity cross-referencing may simply be recorded while the master record remains unchanged. Further entity resolution management activities include Master Data ID management—management of the Global IDs and Cross-Reference (x-Ref) information—and Affiliation Management—understanding and establishment of the relationships between MD entity records that correspond to the relationships they share in the real-world.

At this point, Identification Management and Metadata Management systems will begin to manage the growing metadata and Globally Unique Identifiers that support access to the data now connected to newly discovered entities.

What are examples of entity resolution?

To illustrate, we use the following source data received by an MDM system. Imagine two data sets pulled together with very similar structures, but inconsistent entry data.

Source ID Name Address Telephone
549 Jacob Smith 555 Main St., Freedonia, QT 87456
183 J. Smith 555 Main St., Freedonia 2345678900
349 Joanna Smith 555 Main St., Freedonia 234-567-8900

Between the three entries, standardization appears to be missing, but many similarities are present. Firstly, the surnames create overlap and because the addresses are very close to the same there is cause to believe these entries are related. But the abbreviated first name in entry 183 leaves questions, and the entities need to be resolved. Potentially this entry could represent the same entity as one of the other two, or a third entity living at the same address, or simply be out of date. Similar discrepancies in the telephone fields also present questions. If it’s learned that Jacob Smith’s telephone is different from Joanna Smith’s, then there is a better chance that entry 183 is Joanna Smith. But if entry 549’s telephone is identical to J. Smith, then more information may be needed to resolve the correct entity.

This simplified demonstration shows entity resolutions at a very basic level, sometimes it is performed manually on small data sets using spreadsheets. But these techniques are absurdly inadequate for organizations today who are leveraging their big data as an operational asset. In these big data cases entity resolution needs to be automatic to be effective and efficient. Master Data Management platforms provide these automated entity resolution capabilities.

Learn how Reltio can help.