Deduplication

Deduplication is the process of identifying and removing double records in a database. It is done in two steps:

  1. Matching
    Records that are (potential) duplicates are stored in a duplicate group. Identifying these records is done by using a matching profile. A matching profile defines the rules for identifying duplicate records.

  2. Merging
    The records in a duplicate group are merged into one final record. This can be done manually or automatically. Automatic merging uses a merging profile. A merging profile defines the rules used for merging duplicate records.

Two methods of deduplication

There are two methods to create deduplicate groups.

  1. A record can be in one single duplicate group

    Any record can only be part of one duplicate group. A duplicate groups can contain any number of records (at least two). The match score of the group is the lowest value among all the pairs of records within the group that have a match score above the threshold. It is possible for two records A & B to be in the same duplicate group and yet have a match score below the threshold, as long as there are other records in the same group that do match with both A & B. For instance: if A & B don't match, but A & C match, and B & C match as well, a group will be created containing all three records.

  2. A record can be in multiple groups
    Records can be in any number of duplicate groups simultaneously: one for each other record that matches with it. All duplicate groups consist of exactly two records. The match score of a duplicate group is always equal to the match score of the two records within it. Records that do not match sufficiently will never be in a duplicate group together. For instance: if A & B don't match, but A & C match, and B & C match as well, two groups will be created: one containing A & C, and one containing B & C.

Method A is the default and preferred method. Method B is used in specific situations. Please contact CrossmarX if you need more information about which method to use.

Preparing the blueprint

To be able to start deduplication, a "deduplication group" field has to be added to the class to deduplicate. This field should have a database column name, the data type should be empty or "whole number", and the field role should be set to "deduplication: duplicate group". The field role can be set under the "properties" tab of the field, in the "properties (extra fields)"-section.

Matching

The duplicate matching tool will search through the records of a class for potential duplicates. When comparing two records, a matching score is calculated. This score is calculated using a matching profile, which should be created first. To create this profile, go to Blueprint -> Deduplication -> Matching profiles in the application, and click on "New matching profile". The class to deduplicate should be set here, as well as the fields to be used in the matching process. The total score is the sum of the matching scores per field. If "Points subtracted for non match" has a value and the field does not match this value will be subtracted from the total score. If the score is higher than the value set at "Treshold for potential match", the records are grouped as duplicates. These duplicate groups can contain more than two records.  Most fields that can be set in the matching profile have tooltips to explain them in more detail. See example below.

 

 

The matching profile has a "Match" button to start the matching process. When the matching process is done, all duplicate groups can be found under Blueprint -> Deduplication -> Duplicate groups.

Note on performance:
The duplicate matcher might be matching large amounts of records. To improve performance, at least one matching profile item must have the field "Identical" toggled on.This toggle discards all the non matching records very efficiently.

Merging

First of all, records in duplicate groups can be merged group by group. Click on one of the duplicate groups and select a merging profile. If you want to merge all records select all checkboxes left of the records or select none. You can also make a selection. Now click the "merge" button underneath the rows, and the records will be merged into one. It is also possible to manually select the fields that should be used when merging, by clicking the "merge manually" button after selecting the rows that should be merged.

Of course, in many cases it is preferred to automate the process. The duplicate merging tool can be used for this. It will merge the records of duplicate groups to one record, using a merging profile. The merging profile defines the rules for merging. To create a merging profile, go to Blueprint -> Deduplication -> Merging profiles in your application, and click on the "New merging profile" button.

The following merge rules are availabe. Each rule can be choosen to have a priority from First to Fourth.

  • keep non empty values
  • Merges records field by field. Non empty values are saved. Does not merges the existing value of the target record with an empty value from the source record. After choosing ‘keep non empty values’ the tab ‘Merging profile field items’ appears. If you would like to keep the values of the source records in the target record, select the ‘keep all values’ checkbox. The field will automatically become a ‘can have more values’ field
  • Use source priorities
  • Merges records based on the priority of the fields. Records are sorted ascending based on the values of the priority field. Create a class ‘source’ with 2 fields. ‘Name’ with datatype text and ‘Priority’ with datatype whole number. Set the ‘Priority’ as field role Deduplication source priority. Add the class ‘source’ as a foreign class tot the class your merging on. Set the ‘Source’ as field role ‘Deduplication source’. Now the values of Priority are used for merging
  • Keep most recent creation date
  • Merges records as a whole. The record with the most recent creation date is kept. To actually see the creation date of each record you can add a field that shows the creation date. Go to the class properties – columns in table – add columns. Here you can select the creation date
  • Keep earliest creation date
  • Merges records as a whole. The record with the earliest creation date is kept. To actually see the creation date of each record you can add a field that shows the creation date. Go to the class properties – columns in table – add columns. Here you can select the creation date.
  • Keep most recent modification date
  • Merges records as a whole. The record with the most recent modification date is kept. To actually see the modification date of each record you can add a field that shows the modification date. Go to the class properties – columns in table – add columns. Here you can select the modification date.
  • Keep earliest modification date
  • Merges records as a whole. The record with the earliest modification date is kept. To actually see the modification date of each record you can add a field that shows the modification date. Go to the class properties – columns in table – add columns. Here you can select the modification date.
  • Keep highest id
  • Merges records as a whole. The record with the highest id is kept. This rule works for classes with a autonumber or number key field.
  • Keep lowest id
  • Merges records as a whole. The record with the lowest id is kept. This rule works for classes with a autonumber or number key field.

The record that results from merging the duplicate records can be a combination of values of several duplicate group records, or one of the duplicate group records as a whole. If no explicit rules are defined for merging connected records, all connected records of the merged records are attached to the resulting record. But it’s possible to also deduplicate records of a connected class bij adding a merging profile table item. Click add Merging table item, select a connected class and a type field, this field will be used to deduplicate on. See example below.

 

 

From the merging profile details screen, click the "merge" button to go to the merging screen. Here the records can be merged by clicking the "execute" button. Additional information on parameters can be found in tool tips. For example you can make a selection of the duplicate groups to merge by entering a 'from group id' and a 'to group id'.