Fastest way to load bigquery data into S3 using AWS Glue | Load bigquery data into Spark native dataframe using AWS Glue

0

Hi,

Currently I am loading data from bigquery using GlueContext into dynamic frame. There are around 60 million records that need to loaded in each batch which is taking a lot of time.

I need to load it into spark native dataframe using the follwing syntax:

df = spark.read \
  .format("bigquery") \
  .load("project.dataset.table")

I have downloaded the bigquery connector jars for spark 3.1 and placed it in S3 bucket and using the dependency in glue (using glue notebook as testing environment but want to use glue script as job). But it is not working. I want to find a fast way to load this data into S3.

I am using 2 DPUs and I do not want to increase them due to costing issue.

Safwan
asked 2 months ago114 views
2 Answers
0

2DPU is the bare minimum and one of those is the driver, so you cannot expect to do anything meaningful fast
you probably can save money running let's say 4DPU since it will have 3 times more actual computing for twice the cost, so it should finish much faster and reduce the overall usage "But it is not working", you'll have to be more specific to get help on that.

profile pictureAWS
EXPERT
answered 2 months ago
  • My job is running but it is taking too much time. I am aware that more DPUs would give more computing, but I think loading the data directly into a spark df would take less time as compared to loading it into Dynamicframe of glue. I need a way to directly load the data from bigquery using something like spark.format("bigquery").load(table). My problem is I am unable to directly load it into spark df, hope this clears my problem. Moreover I am getting an error using df = spark.read.format("bigquery").load("project.dataset.table") : ClassNotFoundException: Failed to find data source: bigquery.

0

[Edit - found this, after I typed the below] this goes into performance tuning

Midway thru this blog post they go into the performance tuning of BigQuery - they specifically talk about hardware sizing and scaling out, along with tuning BQ properties.

[Original post] Have you seen this from here:

bigquery_read = glueContext.create_dynamic_frame.from_options(
    connection_type="bigquery",
    connection_options={
        "connectionName": "connectionName",
        "parentProject": "parentProject",
        "sourceType": "table",
        "table": "tableName",
    }
        

I know you said you don't want to use a dynamic frame but since you are loading 60M records, I would doubt that converting from dynamic frame to dataframe would be that significant (I haven't tested this, but just thinking thru what's going to take the majority of time).

If this just won't work for you then to do what you want (spark.read.format("bigquery"), I would mimic what you have to do with iceberg. More specifically, you can follow this

you will need to provide all the bigquery jars via the --extra-jars job parameter, as you said you are doing. The thing is, you probably need an AWS-specific version of those jars because you are probably going to need the AWS SDKs. When you use iceberg with Glue, and use a specific version that is not the one provided by Glue 4.0, you have to also provide the AWS SDKs and an http client (Apache or Url-Connection) to make it all work, due to the service calls with Glue and S3, or example. I don't know if GCP provides those. My guess is that AWS has already worked that out with their ability to use dynamic frames (as shown above) but perhaps they haven't exposed those libraries publicly(?). Hence my recommendation to go with dynamic frame approach.

Lastly, in my experience, like mentioned earlier, it turns out being cheaper to use 4 or 8DPU instead of the minimum, because your job runs much faster, which is what you want also.

One last thought - with iceberg there is also the ability to utilize S3 acceleration, but the Iceberg team built that in, I believe. You might also look and see i the GCP team have provided for S3 acceleration. If your bigquery table is not "close" to your aws region, that could also materially improve performance. It will cost some amount, but in terms of total cost of the job, it will probably be cheaper than being hardware constrained and paying more in the long run.

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions