join-datatables
Joins two datatables based on a common column between them.
This is a commonly-used adaptor for combining two separate datasources. As long as there is a common column, the datatables can be associated and joined.
Arguments
Inputs:
main data
: The main (left) datatable. When inner join
is False
, all rows from this datatable will be included.
main column
: The column in main data
containing values shared by other column
in other data
.
other data
: The other (right) datatable, to be joined to main datatable.
other column
: The column in other data
containing values shared by main column
in main data
. If more than one row match main column value, only the first matching row will be joined. If unspecified, the name of main column
will be used.
inner join
: Specifies whether to include rows that have matching values in both tables. If unspecified, defaults to "false" (all rows from main data will be included, and rows from other data will be included when 'other column' matches 'main column').
columns
: Specifies which columns of other data
to include. If unspecified, all columns in other data
will be included. NOTE: This is a map. Key=column name in other data
, Value=desired column name (this provides ability to rename columns from other data
)
overwrite
: Specifies whether to overwrite columns which exist in both tables. If unspecified, defaults to "false" (columns will be included twice if they exist in both tables).
distance
: Specifies whether to use fuzzy matching when joining the rows. If unspecified, defaults to 100
(full match).
Output:
data
: A datatable containing joined rows and columns.
skipped
: A datatable containing rows from main data
that have not been joined. If inner join
is "false", then there are no skipped rows from main data
.
TIP
: keep in mind that unmatched rows from other data
will not be returned in either data
or skipped
.
Example
The following images show a simple comparison of join-datatables adaptor configuration.
The first output shows that when
inner join
is false (default), all rows frommain data
are included, even when there is no additional data to join fromother data
. Note that when the datatables are swapped, the output is different (see the third output, which keeps unmatched rows from the newmain data
, which was theother data
input in the first table.The second output excludes rows from
main data
that lack a counterpart inother data
. In this case, theskipped
output will be a single row (sample, country, capital.2, us, <empty>
)
Possible use cases
Combine lab and sequencing data with epidemiological metadata.
Use after split-column
Last updated