join-datatables

Joins two datatables based on a common column between them.

This is a commonly-used adaptor for combining two separate datasources. As long as there is a common column, the datatables can be associated and joined.

Arguments

Inputs:

main data: The main (left) datatable. When inner join is False, all rows from this datatable will be included.

main column: The column in main data containing values shared by other column in other data.

other data: The other (right) datatable, to be joined to main datatable.

other column: The column in other data containing values shared by main column in main data. If more than one row match main column value, only the first matching row will be joined. If unspecified, the name of main column will be used.

inner join: Specifies whether to include rows that have matching values in both tables. If unspecified, defaults to "false" (all rows from main data will be included, and rows from other data will be included when 'other column' matches 'main column').

columns: Specifies which columns of other data to include. If unspecified, all columns in other data will be included. NOTE: This is a map. Key=column name in other data, Value=desired column name (this provides ability to rename columns from other data)

overwrite: Specifies whether to overwrite columns which exist in both tables. If unspecified, defaults to "false" (columns will be included twice if they exist in both tables).

distance: Specifies whether to use fuzzy matching when joining the rows. If unspecified, defaults to 100 (full match).

Output:

data: A datatable containing joined rows and columns.

skipped: A datatable containing rows from main data that have not been joined. If inner join is "false", then there are no skipped rows from main data.

TIP: keep in mind that unmatched rows from other data will not be returned in either data or skipped.

Example

The following images show a simple comparison of join-datatables adaptor configuration.

  • The first output shows that when inner join is false (default), all rows from main data are included, even when there is no additional data to join from other data. Note that when the datatables are swapped, the output is different (see the third output, which keeps unmatched rows from the new main data, which was the other data input in the first table.

  • The second output excludes rows from main data that lack a counterpart in other data. In this case, the skipped output will be a single row (sample, country, capital. 2, us, <empty>)

Possible use cases

  • Combine lab and sequencing data with epidemiological metadata.

  • Use after split-column

Last updated