1 Answer
- Newest
- Most votes
- Most comments
0
this is a very interesting scenerio i did reproduce this
source mongodb collection
rs0:PRIMARY> db.testdata.find().pretty()
{
"_id" : ObjectId("628c8ab5f530bdb073b54017"),
"column 1" : "john",
"column 2" : "mark"
}
{
"_id" : ObjectId("628c8accf530bdb073b54018"),
"column 1" : "mike",
"column 2" : "mark"
}
dms transformations on column names allow you to rename but it wont help here
also it is not possible to replace just the space (" ") but it allows to change prefix or suffix
in my case columns are "column 1" and "column 2"
which qualifies for prefix "column "
so made the table mapping as
{
"rules": [
{
"rule-type": "transformation",
"rule-id": "378097618",
"rule-name": "378097618",
"rule-target": "column",
"object-locator": {
"schema-name": "mydata",
"table-name": "%",
"column-name": "%"
},
"rule-action": "replace-prefix",
"value": "column_",
"old-value": "column "
},
{
"rule-type": "selection",
"rule-id": "378089419",
"rule-name": "378089419",
"object-locator": {
"schema-name": "mydata",
"table-name": "%"
},
"rule-action": "include",
"filters": [
]
}
]
}
this mapping would then change prefix "column " to "column_"
final s3 parq file schema and data
[ec2-user@ip-10-0-0-4 ~]$ parq LOAD00000001.parquet -s
# Schema
<pyarrow._parquet.ParquetSchema object at 0x7fda6fa8b5d0>
oid__id: BYTE_ARRAY String
column_1: BYTE_ARRAY String
column_2: BYTE_ARRAY String
[ec2-user@ip-10-0-0-4 ~]$ parq LOAD00000001.parquet --head
oid__id column_1 column_2
0 628c8ab5f530bdb073b54017 john mark
1 628c8accf530bdb073b54018 mike mark
I hope this dynamic change of column prefix helps.
answered 2 years ago
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
I tried this but it is not working. Below is subset of document from Mongo that I am trying to migrate to S3. { "_id": { "$oid": "123" }, "synonyms": { "Ceramic": [ "Ceramic" ], "Stainless Steel": [ "Stainless Steel" ] } } Columns printed from parquet: ['oid__id', 'array_synonyms.Ceramic', 'array_synonyms.Stainless Steel'] The spaces are still there. Also, have commas too in some columns, please help me with this.
in your data set, data seems very unique and not aligned to field name as "Column " did you tried making a copy of collection and replacing the spaces on the source itself.