Automatic deduplication job
Potential duplicates are identified in two steps:
- Candidate location: potential duplicates are located in the database using specific search criteria.
 - Duplicate validation: potential duplicates are checked against a set of rules (different for every content type).
 
By default we will handle a field in the following manner
Single/Multi-valued Fields:
- Merge Action: ADD
- If a duplicate of the target publication holds a value for a field that the target publication does not contain, the field value form the duplicate will be used
 
 - Merge Action: REPLACE
- If a duplicate of the target publication holds a value for a field that the target publication also contains, the field value from the duplicate will be used
 
 
Multi-valued Fields:
- Merge Action: MERGE
- Merges values from a multi-valued field on a duplicate to the multi-valued field on the target
 
 
Overview of automated deduplication job
The table below specifies the criteria for deduplication for all supported content types.
Search criteria: One of each elements has to be fulfilled
| Content type | 
 Location: search criteria (One has to be fulfilled)  | 
 Validation: found candidates must match on (All has to be fulfilled)  | 
||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Research output* | 
  | 
source/source ID combination*  | 
 Title (80 pct match)  
  | 
Subtitle (80 pct match) | 
 | 
One of the lines has to be fulfilled. | ||||
| DOI | 
 Title (80 pct match)  
  | 
Subtitle (80 pct match) | ||||||||
| Title (80 pct match) | Subtitle (80 pct match) | 
 Number of pages   
  | 
Persons (by name) | Year | 
 If found: 
  | 
|||||
| Journal | 
  | 
at least one title | ||||||||
| Publisher | 
  | 
name | country | |||||||
| Event | 
  | 
title | if found, then also | city and country | if found, then also | period | ||||
| External Organisation | 
  | 
name | type  (Ignored if one is "unknown")  | 
country  (Ignored if both are null)  | 
subdivision  (Ignored if both are null)  | 
city  (Ignored if both are null)  | 
state  (Ignored if both are null)  | 
|||
| Person* | 
  | 
at least one first name | one last name | if found, then also | 
 Scopus Id (Must not contradict)  | 
Orcid Id  (Must not contradict)  | 
||||
| External Person | 
  | 
name | 
 Possible to configure: Match all External organisations  | 
|||||||
| Activity* | 
  | 
Title (90 pct match)  If title generically generated: Description (90 pct match)  | 
Visibility  | 
Type/template | Period | 
 Can be configured: Persons, and organisations  | 
If of a type where one Event, Publisher etc can be chosen, there need to be a match on those | 
 See here for more details: Deduplication of Activities 
  | 
||
| Prize* | 
  | 
Title (90 pct match) | 
 Visibility 
  | 
Type/template | Date | Persons | Organisations | See here for more details:Deduplication of Prizes | ||
| Application* | 
  | 
 Title (90 pct match) 
  | 
 Visibility 
  | 
Type/template | Period | Persons | Organisations | 
 See here for more details:  | 
||
| Award* | 
  | 
 Title (90 pct match) 
  | 
 Visibility 
  | 
Type/template | Period | Persons | Organisations | See here for more details: Deduplication of Awards | ||
| Course* | 
  | 
 Title (90 pct match) 
  | 
 Visibility 
  | 
Type/template | Period | Persons | Organisations | 
 See here for more details: Deduplication of Courses 
  | 
||
| DataSet* | 
  | 
 Type/template 
 
  | 
 Doi 
  | 
If doi not found: | Title (90 pct match) | Description | Visibility | Person | Organisation | 
See here for more details:  Deduplication of DataSets  | 
| Press/Media* | 
  | 
Title (90 pct match) | Visibility | Type/template | Period | Persons | Organisation | 
See here for more details:   Deduplication of Press/Media (clipping)  | 
||
| Projects | 
  | 
Title (90 pct match) | Visibility | Type/template | Period | Persons | Organisation | 
See here for more details:  Deduplication of Projects  | 
||
*Research output: Title and subtitle with an high similarity score. 
*Person: The more recent person, based on employment dates and whether each person has active employments, is used as the target of the merge.
*Activity, Prize, Application, Award, Course, DateSet, Press/media, projects: The jobs are only available on request.
Detailed description
Research Output
Search CriteriaIn order to determine if a publication is a duplicate we use three strategies. 
 Validation criteria 
 Configuration: 
  |