Data transformation not taken into account in AWS Glue
I have a S3 bucket with folders in which we have files. I want to make a database to be able to query these documents on a few keys with an API based on Lambda. But for that I need to normalize the data. For example I need to transform all the files in the folder /jomalone/
as the following:
{
"data": {
"products": {
"items": [
{
"default_category": {
"id": "25956",
"value": "Bath & Body"
},
"description": "London's Covent Garden early morning market. Succulent nectarine, peach and cassis and delicate spring flowers melt into the note of acacia honey. Sweet and delightfully playful. Our luxuriously rich Body Crème with its conditioning oils of jojoba seed, cocoa seed and sweet almond, help to hydrate, nourish and protect the skin, while delicious signature fragrances leave your body scented all over.",
"display_name": "Nectarine Blossom & Honey Body Crème",
"is_hazmat": false,
"meta": {
"description": "The Jo Malone™ Nectarine Blossom & Honey Body Crème leaves skin beautifully scented with fruity notes of nectarine and peach sweetened with acacia honey."
},
...
{
"currency": "EUR",
"is_discounted": false,
"include_tax": {
"price": 68,
"original_price": 68,
"price_per_unit": 38.86,
"price_formatted": "€68.00",
"original_price_formatted": "€68.00",
"price_per_unit_formatted": "€38.86 / 100ML"
}
}
],
"sizes": [
{
"value": "175ML",
"key": 1
}
],
"shades": [
{
"name": "",
"description": "",
"hex_val": ""
}
],
"sku_id": "L4P801",
"sku_badge": null,
"unit_size_formatted": "100ML",
"upc": "690251040254",
"is_engravable": null,
"perlgem": {
"SKU_BASE_ID": 63584
},
"media": {
"large": [
{
"src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 1000,
"width": 1000
},
{
"src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_1.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 1000,
"width": 1000
}
],
"medium": [
{
"src": "/media/export/cms/products/670x670/jo_sku_L4P801_670x670_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 670,
"width": 670
}
],
"small": [
{
"src": "/media/export/cms/products/100x100/jo_sku_L4P801_100x100_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 100,
"width": 100
}
]
},
"collection": null,
"recipient": [
{
"key": "mom-recipient",
"value": "mom_recipient"
},
{
"key": "bride-recipient",
"value": "bride_recipient"
},
{
"key": "host-recipient",
"value": "host_recipient"
},
{
"key": "me-recipient",
"value": "me_recipient"
},
{
"key": "her-recipient",
"value": "her_recipient"
}
],
"occasion": [
{
"key": "thankyou-occasion",
"value": "thankyou_occasion"
},
{
"key": "birthday-occasion",
"value": "birthday_occasion"
},
{
"key": "treat-occasion",
"value": "treat_occasion"
}
],
"location": [
{
"key": "bathroom-location",
"value": "bathroom_location"
}
]
}
]
}
}
]
}
}
}
In a json with the following schema:
brandName String
productName String
productLink String
productType ?
maleFemale Male/Female
price float
unitPrice String
size float
ingredients String
notes String
numReviews Int
userIDs float
locations float
dates Date
ages int
sexes M/F
ratings Int
reviews Array of String
sources String
characteristics String
specificRatings String
So I have tried AWS Glue but I don't know how to get rid of the nested data as the keys at the beginning:
"data": {
"products": {
"items": [
...
Indeed, I used to test the modifications on the names:
But it doesn't seem to have any of the consequences I was looking for if I am to believe the Preview tab:
I had indeed deleted the first and last soubrayed fields and modified the others but none of this seems to have been taken into account in the Preview.
Indeed it doesn't seem there is anyhting like at least mapping in the related script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="json",
connection_options={"paths": ["s3://datahubpredicity/JoMalone/"], "recurse": True},
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[("data.products.items", "array", "data.products.items", "array")],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="json",
connection_options={"path": "s3://datahubpredicity/merged/", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()
- Más nuevo
- Más votos
- Más comentarios
Contenido relevante
- OFICIAL DE AWSActualizada hace 3 años
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace un año