Data Wrangler Full Outer Join Not Working As Expected Nor Concatenate

0

I've got two CSV files that are loaded into Data Wrangler that are intended to augment each other. The tables have some columns that are the same (in name) and some that are not, many of the rows are missing entries for many of the columns. The two tables represent separate datasets. Consider the example below: Table 1:

FilenameLabelALabelB
./A/001.dat11
./A/002.dat01

Table 2:

FilenameLabelBLabelC
./B/001.dat0
./B/002.dat01

I am looking to merge / concatenate the two table. The problem is that neither Data Wrangler join nor concatenate seems to work (at least as expected).

Desired result:

FilenameLabelALabelBLabelC
./A/001.dat11
./A/002.dat01
./B/001.dat0
./B/002.dat01

When using a "Full Outer" join and ask to combine "Filename" and "LabelB" columns, it will take all the values from Table 1 OR Table 2 even if Table 1 does not have that entry (for example, some rows will have Filename = <nothing> rather than Filename = ./B/001.dat).

When using concatenate, Data Wrangler errors on the fact that it cannot match EVERY column between the tables.

Now in my example there are many columns and many rows which precludes a manual process of joining without merging columns and then going through a renaming and merging process one-by-one. How do get these tables to simply merge? I feel I must be missing something obvious. I am about to give up on Data Wrangler and do it all in a python script using pandas, but I thought I should give Data Wrangler a try while learning the MLops process.

asked 2 years ago468 views
1 Answer
0

Hi, @b33fcafe

Shouldn't you use Concatenate instead of Join because you don't want to join data horizontally but vertically?

After importing Table1 and Table2, add the missing columns and then Concatenate to get the expected result.

If duplicate data occurs, additional processing is possible in subsequent steps.

profile picture
EXPERT
iwasa
answered 2 years ago
  • Thanks for the reply. It has helped me to understand the philosophy of join. Seeing as concatenate is not tenable with the column matching issue, I'll just use a pandas in a notebook to handle it in code. Any insight into why Data Wrangler handles empty entries, but won't just assume that for unmatched columns in a concatenate?

  • Seeing as concatenate is not tenable with the column matching issue

    You can do it with Pandas, but you can also concatenate by adding the missing columns in the Data Wrangler's data flow.
    Since concatenate is a function to concatenate datasets with the same structure, it seems that the same column is required.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions