Insights
Master Data Management, how to match and merge records to unify your data
16 min read
By Julien Kervizic

In most organizations, it is particularly impactful around entities such as products and customers. To achieve clean results and data, records needs to be associated and merged to form this unified view of the entities.

There are multiple challenges that data professionals face when dealing with the creation and maintenance of master data entities. From the definition of a matching strategy, the setup of a consolidation strategy and field authority strategy.

Matching Strategy

Matching Method

There are different types of matching that can be accomplished from automated rules to match data together, or have it gone through a review process.

The matching methods used can be set up to handle different matching certainties, with the most certain matches going through an automated process and the more uncertain ones, going through a review process. This type of approach can be useful to get the appropriate samples to train a prediction model.

Matching Types

There exists a variety of matching types possible, from exact matching, phonetic matching, propensity, fuzzy or negative matching.

Exact Match: An exact match relies on being able to link two different sources of information based on the existence of a specific key that allows to match the information. This can be, for instance, an EAN code for products, an national identity number, the concatenation of a name, and an address.

Phonetic Match: There can cases when an exact match might not be sufficient to handle the merging of information. This usually happened when data needs to be inputted by hand based on vocal input. For instance, recording a name or address at a point of sales (POS) system. When this happens, we need to be able to relax the matching algorithm to handle phonetically similar cases.

In python, the Phonetics library incorporates a few of the most popular phonetic matching algorithms. Generally, these algorithms work well for English but often do not support other languages.

In the above example, we can see how Julien and Julian, have the same phonetic representation using the Soundex algorithm, explaining why Starbucks Barista often misspell my name to Julian.

Phonetics algorithms such as Soundex are incorporated in some databases such as Postgres or SQLServer.

Fuzzy match: A fuzzy match is a match that is not exact. Using fuzzy matching comes with is a trade-off between exactness and coverage of the match. Some of its application relates to allowing for different types of spelling or input misspelling, matching users when other criteria match and their address are within a certain radius of each other. The following tutorial from datacamp shows how we can compare the different strings and generate a score enabling this inexact match.

Propensity match: Propensity matching provides a different type of in-exact matching, but this time relying on a prediction model to produce the likely value that a match is valid.

They provide a certain level of confidence that we should consider the match, with a threshold for assignment that can be configured.

Provided we can obtain some training data; we can leverage supervised learning methods such as logistic regression to generate this propensity score.

To compare two records containing text inputs for matches, we need to be able to generate features either from their raw, cleansed, or phonetic representation and is to compare the two sets of strings. We can also add additional features to the mix.

The above will create features based on the edit hamming distance. It will look up pair of column values in a data-frame assuming the same column names are suffixed by _1 or _2 and compare their raw representation and lowercased representation using hamming normalized distance. The following is a sample output:

Based on that data, we can train a prediction model. This is easily done using sklearn:

The probability score can then be computed and compared to the threshold value.

Negative Match: Negative Matching provides additional conditions/exclusion rules, about when not to match two records together. They can be used to exclude an unlikely event.

An example of when negative matches could be used is showed above. When comparing two sets of user records and figuring out that the distance between the records, two addresses are above a certain threshold. This type of negative match, when implemented within a workflow, could make it so that an automated match is not applied but still flagged for manual review.

Considerations

There are multiple considerations to have when looking at the merging of data. Entities can have multiple names, languages can be impactful in the cases of names, places, or addresses, data should be normalized, and data cleansing is usually necessary before matching the different attributes.

Name frequency: When used for matching, one of the factors to take into consideration is is the frequency of names. Different names and different frequency of occurrences, and common names should need a higher burden of evidence before being matched together.

The above example shows how unlikely my name is to be found compared to John Smith. Given that it is so unlikely to be found, it should require much less external evidence than for a John Smith. If, for instance, we were to try to match records of a John Smith, we might need to add either an additional phone number of address match to the mix.

Pseudonyms and Alternative Naming: It is worth noting that different entities can have alternative nomenclatures, be them official, stage names, or aliases.

Take the example of the city of Astana originally called Akmolinsk further renamed to Tselinograd and Akmola, Astana, and now called Nur-Sultan or the city of Madras nowadays named Chennai. Or of some of the historical characters, Alexander the great, Pliny the young, or singers EMINEM, 50 cents, or even actors such as Ben Kingsley whose true name is Krishna Pandit Bhanji.

Languages: Languages tend to be an essential consideration to have in matching different entities. Take, for example, the matching of various sources in English and Chinese. The entity might be the same, but there are different representations of the same name across languages.

Take, for example, the Nur-Sultan case mentioned before, the same entity has different name representations, and just one of these representations Nur-Sultan has a diverse language representation in French, English, and Chinese.

Records merging across languages is heavily reliant on an appropriate multi-lingual data model coupled to having linking data across languages.

Abbreviations: Abbreviations impact how different text records might need to be matched. From first names shortening, such as Bill for William or Pete for Peter, to title abbreviations such as Dr. for Doctor, they make it harder to find matches without additional processing.

Data Cleansing & Data normalization: Trying to match differently text fields also requires some degree of data cleansing. From handling potential trimming of spaces, special characters, and punctuations to correcting misspellings, there are numerous steps to take to standardize the input and allow it to find an accurate match.

For example, phone numbers can be provided in quite different formats, for the same true phone number. They can be provided with a county code information, in + (+44) or 00 format (0044), or without country code, they can be provided as a pure numerical chain (0601020304) or with separators (060–10–20–30–4). To be able to perform a full match, it is crucial to normalize the data in the same format.

REGEXP is a particularly useful tool in this respect, allowing to match strings and substrings based on different patterns.

In the example above, the REGEXP matches sequences of digits and sequences of digits preceded by a +. This provides a decent first step towards cleaning and normalizing phone numbers. There are better solutions for this particular use case than relying directly on REGEXP, but it proves useful as soon as you need to do something a little custom.

Authority Strategy

An authority strategy help define which fields in the strategy should be considered as the authoritative source of information, when conflicting information is provided. There exist different types of authority strategies, such as authority hierarchy, to time authority, to voting rules.

Hierarchy

An authority hierarchy helps define a preference for taking certain types of information from different sources. The logic behind this type of strategy is that some sources of information should be considered as “trusted sources,” while others have information of different qualities.

Let’s look at the information contained from a government database of national identity and names. We can consider this as a highly trusted source of information. Compare this to an input form on a website, on a site where typos or bad input can occur both for the identity and names.

The example shown above shows how an overall hierarchy strategy could work in selecting (in yellow) profile attributes in a consolidated record.

Having an authority hierarchy of information help in this case to prioritize the information coming from more trusted sources than others. In the example explained above, we would consider the government database the more authoritative source of information and use that information where provided. In case the information from this source is not available, we would be relying on the input-form, but only as a complement for the authoritative source.

Time-bound

Different fields of information can be “mutable” — i.e., they are liable to change, and it is important how to treat information across time.

Fields such as names are relatively immutable. They can technically change — for instance, when getting married or requesting a name change with the government. But in the majority of cases, they wouldn’t frequently be changing.

Other fields, such as an address or a phone number, have a higher tendency to be mutable. You can easily change the address or phone numbers. Fields with a higher mutability tendency tend to benefit more from applying a time-based authority strategy. You would want to have the freshest source of information.

The table above shows what would happen to the attribute selected in the consolidated profile if we extended the previous example with a time-bound authority strategy for the email and mobile fields.

Voting rule

When dealing with multiple sources of information for the same fields, it can be beneficial to apply a voting rule authority strategy. The authoritative field would be defined through a majority vote across the different sources of information.

A voting rule authority strategy can provide a high level of signal when trusted sources of information are inexistent. Still, there exists a high number of available sources that can be correlated.

Extending the example discussed above with a voting strategy for the address field, besides the fact that more misspelling of my name happens, we can see that the address provided by the government data source is no longer selected for the unified profile. This is due to more than one data-source providing an alternate address.

Considerations

The types of authority strategy that needs to be applied should be very much field dependent

Multilevel strategy: Depending on the number of data sources available, it might be useful to group the data-sources by classes and apply a multilevel authority strategy on these different fields.

Time-Dependent: No matter what strategy you use, time will always be a factor to some extent. It is important to place time boundary on most fields to apply the strategy, in the voting strategy example it wouldn’t have been very smart to apply the voting strategy if all the records resulting in the vote for “Rodeo-drive” had been before an authoritative record (in terms of source hierarchy). Likewise, if the last update we got from an “authoritative” source is a few years old, it might still be worth reconsidering whether to use a “fresher” source of information.

Consolidation Strategy

The consolidation strategy dictates how the records are merged. There are generally two different types of consolidation strategies hard merge and soft merge. The hard merge consolidate together multiple records into one single record, while a soft merge strategy creates an association between the different records that, when read, should be grouped.

Hard Merge

Within a hard merge, there are two subtypes of consolidation strategy that can be applied. Records can be hard merged using forward only merging strategy or a backward merging strategy — both subtypes of hard merge results in a consolidated datasets.

Above is an example of how a hard merge would look like based on a mixed hierarchy/time-bound authority strategy.

There are benefits to performance a hard merge such as reduction of database size, more efficient queries, and ease of extraction of these authoritative fields. There are however, a few drawbacks the three main ones being 1) data loss 2) Irreversible merging 3) not all authority strategies play well with hard merging.

The forward only merging strategy, will tie the records together after a matching condition has been met. It will not impact the historical records.

in yellow the events considered in the merged profile

Let’s take an example where we have two profiles coming from different sources.

  • We are receiving a stream of events from each of these sources associated with the profiles.
  • At some point, some information is added to the first profile that allows for matching the identity to profile 2
  • Profile 2 is hard merged onto profile two at this point, the historical data is still leveraged from Profile 1, but not from Profile 2
  • As new events come in, they are directly associated to Profile 1

One method to implement the forward merging of associated data is by modifying a redirection table. For the example mentioned above, this can be done in the following way in python:

At some point, we merge the profiles together and update the redirection table, ie, both ids now refers to the same first profile:

Pushing additional events after the forward merging of identities:

This gives us the following events in each profile after the forward only merge happened:

Besides the id redirection and the merging of attributes, the merging of the profile might incorporate further steps such as the deactivation of the previous profile.

The backward merging strategy, will tie records back historically and merge together the different records that have been identified with matching conditions.

In the case of backward merging, the data related to both profiles are merged onto one both historically and as times goes on. In code, this type of strategy can be implemented as an initial merge of events related to both profile and an id redirection.

This gives us the following events in each profile after the backward merge happened:

Soft Merge (Association)

Soft merging strategies rely on an association to be created between the different records. They allow for the different records to be grouped. Like for hard merge, two main strategy subtypes can be applied, a full association or an association with filtering conditions.

One of the main advantages of using a soft-merge strategy is that the association can always be undone. It does however, have some space and performance disadvantages. Such that all the records would need to be maintained, and queries would need to be created that would need to look up records that are associated and then apply the authority strategy on these records.

Full Association

A full association provides an association record for the different entities provided. Think of a table [id1, id2] that allows to associate any record with another one, this table can allow any record to fetch information to any associated user record, and an authority strategy could be applied to the different values onto it.

An application for this is, for instance, the ability to match an anonymous session to a logged-in user identity. The identity key that was provided before login is a specific cookie value, for instance, a google analytics client id, while once logged in, a new identity is provided that needs to be associated with.

The full association allows to both extract on-going activities across the different identities, but also allow them to leverage historical data through it.

To make an association work, we need to be able to tie a given profile to one or more profiles. Taking back the previous code example, we can make a few alterations to the profile class to support this.

We need to incorporate a way to 1) store associations 2) add associations 3) query the relevant information obtained by association (in this case events).

Following the same pattern as in the previous example and performing the association. We can see how the full set of events is now captured post association.

But contrary to the profile merging approach, it is still possible to retrieve the events directly associated with each profile:

Association with filtering

An association with filtering provides a little more control as to how the different data-points will be consolidated together. It allows us to implement a forward only kind of merge as an association, useful when you are only able to leverage the joint record if a user has accepted new terms of services, for instance.

Implementing association with filtering is possible but requires some additional code changes to be able to exclude the unwanted records.

It is yielding the following results when ingesting the different events.

Unlike with the forward merging strategy, it is possible to take the viewpoint from both profile 1 and profile 2 perspective.

Considerations

There are multiple things to consider when looking at which consolidation strategy to use the volume of data, performance, complexity, and the need to un-match records.

Volume of data: The overall data volume is a factor that plays in the decision of which strategy to apply. Consider transient identifiers such as website session ids, each able to create a temporary profile. A given user can have hundreds of temporary profiles and potentially one logged-in profile. To fetch the authoritative profile, the application would have to go through all these temporary profiles and apply the authority strategy. Another example where the volume of data can be impactful is when relying on user input. Think of a website letting its users input information related to their favorite holiday destination. This could result in thousands of duplicate records for each destination due to spelling error, different names… Keeping only a soft merge strategy would be quite tricky.

Uniqueness of matches: The matching strategy may have to deal with non-unique matches. In case of a soft-merge strategy, only an association key would need to be added to properly deal with that information. While in the case of a hard merging, this may lead to information being duplicated across key records. Additional matching rules may be added, such as only merging the first initial record found. Still, the uniqueness of match is a consideration to have when setting up the matching and merging strategies.

Performance and complexity: Using a soft-merge strategy typically offers lower read performance than a hard merge strategy, and as we saw in our simple code examples above, implementing a soft merge strategy usually requires some additional complexity.

Uncertain matches: Different types of matching strategies and identifiers. lead to a different risk of matching records that do not belong together. When dealing with strategies that lead to risky matches, it is often better to apply a soft merge strategy. Soft merge makes it easier to accept more un-certain matches are consolidated together as the association can always be undone.

Regulation: Regulation sometimes plays a role in terms of how records should be merged together. It can, for instance, dictate what data should be available to be used in a consolidated profile or what data could be used for which purpose. Association with filtering is the consolidation strategy that would most easily satisfy different regulations, but also the most complex to integrate.

Summary

There are multiple steps and considerations to have when handling master data management, each of the three strategies that need to be applied when consolidating records has numerous facets and considerations to have.

Often there needs to be a granular setup not only at the entity level but at the field leve to apply the right strategy to the right piece of information. The question is then more about balancing the different pros and cons of the various (sub)strategies that need to be applied.

It is often a safe bet to start from a risk-aware strategy to record consolidation and use a soft-merge strategy with review to ensure the best quality and handle the more advanced aspect and more automated consolidated merging at a later stage.