Published at October 27th, 2025 Last updated 7 days ago

Deduplication of Activities

Activity

This job is intended for use with the Community Module

 

 

Duplicate Resolver:

The menu "duplicate titles" is available.
"Manage duplicates" is available in the editor.

If two records are marked as duplicates, see below for the merge strategy.

Automated Deduplication job:

Criteria decorators:

This finds candidates through DB query (If either of them is fulfilled the pair is considered as potential duplicates).

  • Titles are at the least 90 pct similar and year is the same
    • Search through all existing Activity
    • If there is an Activity, where both of these are fulfilled, this is considered a potential duplicate
      • Titles:
        • Titles are 80 pct similar
      • Date :
        • If the specific date is present, then this year will be used
        • If not, but start date in period is present, then this year will be used.
  • OR
  • Classified id 
    • Search through all existing Activities
    • If there is an Activity with a classified id (Source) of same type, and identical value, this is considered a potential duplicate
  • OR
  • All persons and dates are identical, and titles are similar
    • Search through all existing Activities
    • If there is an activity, where both are fulfilled, this is considered a potential duplicate
      • Date:
        • The date has to be identical
          • Only year
      • Persons:
        • The same persons are assigned to both activities
      • Titles:
        • Titles are 60 pct similar

Duplicate match strategy:

This is a programmatic match. All criteria must be met, before proceeding to merge.

  • The template is the same.
  • AND
  • If both contents have a classified id of the same type with the same value
    • if titles have an equality score (Levenshtein Distance) above 80 pct 
      • Contents will be merged 
  • If not, all of the remaining have to be fulfilled:
    • Visibility is the same
    • AND
    • Titles are at the least 90 pct similar
      • Cleaned for tags
      • Made lower case
      • Note:
        • If title is generically generated (from Event, Journal, etc.) the description is used as the title
          • This is ex. to prevent merging two talks from the same person at the same event
    • AND
    • All organisations are present for both duplicate and content 
      • First checked with name
      • If not fulfilled checked with classifiedID
        • All organisations need to have a classified ID
    • AND
    • All persons are present for both duplicate and content (matched with name) OBS: Persons on source and targets are required to have same roles.
      • Persons need to have the same name (Also if stated as internal at one activity, and external or the other) before we merge.
        • Different name variants are also checked
    • AND
    • Type (Type Classification) is the same
    • AND
    • Start date are the same (Specific date or start date in period is the same) 
      • If month is present for both they are also compared, else ignored
    • AND
    • Activity category is the same (If present, else ignored)
    • AND 
    • If the activities are:
      • Of a type with one of these: Event, External Organisation, Organisation, Publisher or Journal.
        • That has to be the same on both activities
          • EX: if there is an Event assigned, both activities need to have the same Event assigned
      • Of a type with a host or a visitor
        • That has to be the same on both activities
  • OBS: It is possible to toggle that identical organisations and persons is not a requirement for merge.

Merge converter:

For the merge purpose a target is chosen from the duplicate candidates. If nothing else is stated the target value will appear on the merged version.

The target is the activity that have existed longest.

  •  Sources:
    • A predefined function in “Abstract merge converter”
      • Merge source and source ID
        • If target source id or target source is empty
          • The source of the sourceContent is set as the target source
          • The sourceId of the sourceContent is set as the target sourceId
          • The external status of the sourceContent is set as the target external status
        • If the target source and sourceContent source and the target sourceId and sourceContent sourceId is not equal
          • If the target has no secondary source that match sourceContent source
            • The sourceContent source is set as a secondary source for target
      • Merge secondary sources
        • If sourceContent has secondary source
          • For each of the secondary sources in sourceContent
            • If each of the following is fulfilled
              • Secondary source does not match a source in target
              • Secondary source Id does not match a sourceId in target
              • If there is no secondary source in target that matches the secondary source
                • The secondary source is set as target secondary source
      • Merge source data
        • If the target source data does not contains the sourceContent sourcedata key
          • The sourceContent DataEntry is set as the target source data
  • Ids:
    • A predefined function in “Abstract merge converter”
      • For sources in sourceContent
        • If there is no matching classified source in target
          • The classified source is cloned and added to target sources
  • PreviousUuids:
    • A predefined function in “Abstract merge converter”
      • The sourceContent uuid is set as previousUuid for target
      • If sourceContent has previousUuid
        • These are also added as previousUuid
  • Keywords:
    • A predefined function in “Abstract merge converter”
      • If source has keywordsGroups
        • For each of the keywordGroups in source
          • If not target already has the keyword 
            • The keyword is added to target
  • Links:
    • A predefined function in “Abstract merge converter”
      • For each link in source
        • If target does not have the link
          • The link is added to the target link
        • If there is a link with identical sourceId
          • The Link is updated
  • Documents:
    • A predefined function in “Abstract merge converter”
      • For each document in source, if either of the following if fulfilled
        • Target has a document with same source source Id
        • Target has a document with same file name
        • Target has a document with same tile 
          • The document is cloned and added to target document list
  • Clipping relations
    • Relations is added to target, if not already present
  • Publication relations:
    • Relations is added to target, if not already present
  • Impact relations:
    • Relations is added to target, if not already present
  • Equipment relations:
    • Relations is added to target, if not already present
  • Thesis relations:
    • Relations is added to target, if not already present
  • Title:
    • Copy only locals that does not exist already
  • Descriptions:
    • For each description in source
      • If target does not contain a description of the given type
        • The description is cloned and added to target list of descriptions
      • If there is a description with same type
        • Locals that does not already exists are copied
  • Organisations:
    • For each organisation in source
      • If not:
        • Target already have the organisation
        • OR
        • Target organisations and the source organisation have identical sources
          • A new organisation association is constructed
            • The source organisation is added to new association
            • The sources for the source organisation and the new association are merged
            • The new association is added to target list of organisations associations
  • External organisations:
    • For each of the source external organisations
      • If not:
        • Target already have the external organisation
        • OR
        • Target organisations and the source organisation have identical sources
          • The external organisation is added to target list of external organisations
            • The existing external organisation is just added to the target list of external organisations
  • Persons:
    • For each person association in source
      • If target has an identical person, with same role (Checked with name and name variants) 
        • Secondary sources from person association in source is assigned to target person association
        • If the source person is internal, and the target person is external
          • All internal organisation associations are added to the internal person
          • The external person from target, is replaced by the internal person from source
        • If target person is internal, and the source person is external
          • The internal organisation associations are added to the internal person
      • If target does not have an identical person, or have the same person with another role
      • OR
      • If target does not a person with same source id
        • A new Person Association is created
        • From the person Association in source the following is added to new association:
          • Person
          • Source Id
          • Source
          • External Organisation
          • External or not
          • Name
          • Secondary sources
          • Source Data
        • The person association is added to target
  • Indicators
    • For each of the source indicators:
      • If not:
        • Target already have that indicator
        • OR
        • Target indicators and the source indicator have identical sources
          • The source indicator is added to target list of indicators
            • No clone is generated, the existing source indicator is just added to the target list of indicators.
  • Owner:
    • If target is null. It is replaced by source
  • Enddate:
    • If target is null. It is replaced by source
  • Startdate:
    • If target is null. It is replaced by source
  • Degree of recognition:
    • If target is null. It is replaced by source
  • Visibility:
    • Lowest visibility wins
  • Activity Activity Relations:
    • Activity relation is moved from source to target