Splitting, Clustering and Joining


Looking first at the creator field, let’s use OpenRefine’s clustering feature to identify and merge similar but non-identical values across a large dataset.

Split the Creator Field into Individual Names

a. Select the dropdown of the Creator Column

b. Choose “Edit cells” > “Split multi-valued cells…”:

  • Enter ; as the separator to split each name into its own cell within the same column.
  • This will separate names like “Painter, Charles G.; Simpson, William Ray; Parks, F.P.” into three individual rows for each name.
Gif image of the OpenRefine interface splitting named entities into distinct cells
Splitting Multi-Value Cells

Formatting

Before we cluster, I’m noticing many of the discrepancies between the author names are due to inconsistent whitespace. To correct these items or adjust for upper / lowercasing inconsistencies you may encounter:

Lowercasing and Removing Whitespace

a. In the column menu (e.g., location name), choose “Edit cells” > “Transform…”.

b. Enter transformations to standardize formats:

  • To make all entries title case: Enter value.toTitlecase().
  • To make entries lowercase: Enter value.toLowercase().
  • To trim leading or following whitespace: Enter value.trim().
Gif of the OpenRefine Interface Trimming Whitespace From Column
Trimming Whitespace From Column

Clustering

a. Click on the drop-down menu in the column you want to standardize (e.g., author).

b. Select “Edit cells” > “Cluster and edit…”.

  • A dialog box will open, showing clusters of similar values. For example, it might group “Mark Twain” and “Twain, Mark” as potential matches.

c. Use different clustering methods discussed in the previous section to identify similar entries.

d. Review each cluster: If you agree with OpenRefine’s grouping, type in the standardized name you want to use (e.g., “Mark Twain”) in the “New Cell Value” field and click “Merge and Re-Cluster” or Merge and Close if you are finished with all of the different algorithms you are interested in applying.

Gif of the OpenRefine Interface Clustering Entity Variations with Various Algorithms
Clustering Entity Variations with Various Algorithms

Group Names Back Together with Semicolons

a. Select the Dropdown of the Column:

b. Choose “Edit cells” > “Join multi-valued cells…”:

c. Set the Separator:

  • In the modal that appears, enter ; as the separator (the same one used when you split the names).

d. Click “OK”:

Gif of the OpenRefine Interface Joining Multi-Valued Cells To Complete This Process
Joining Multi-Valued Cells To Complete This Process