Questions tagged with AWS Glue
Content language: English
Sort by most recent
Hi,
I don't see Glue DataBrew in Terraform's AWS provider.
I do see that it's supported in CloudFormation (https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-databrew-job.html).
Can anyone help me locate it please?
Hello All,
I am trying to implement solution mentioned at below link :
https://medium.com/analytics-vidhya/multithreading-parallel-job-in-aws-glue-a291219123fb
In this solution they have shown AWS logs showing of Scheduler settings . I am not getting where will I get these complete logs .
I am running Glue job from console and there I see 3 types of logs
All logs
Output Logs
Error logs
When I am opening all logs > I don't get anything
Output logs > getting output what I am printing in my script . This also sure it also shows something related to pyspark application .
Error log > I am not sure about
I'm running a Glue 4.0 job with some local algorithmic process. I tested this on my local instance and it works fine.
`from sklearn.model_selection import StratifiedGroupKFold, RandomizedSearchCV`
But when I run it on Glue, it gives me exception,
```
ImportError: cannot import name 'StratifiedGroupKFold' from 'sklearn.model_selection' (/home/spark/.local/lib/python3.10/site-packages/sklearn/model_selection/__init__.py)
```
The Glue 4.0 does have a `scikit-learn=1.1.3`, which are compatible with the version on my local instance according to this https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html not sure why this happens?
**Update I**
A little bit weird. I tried output the sklearn version in the Glue job, it shows `scikit-learn==0.24.2`, which doesn't match the official doc. Was there a mismatch?
**Update II**
I tried to append below configs to force upgrade the scikit-learn version. But just not a perfect solution since the lib version mismatch.
```
--additional-python-modules: scikit-learn
--python-modules-installer-option: --upgrade
```
Hello All,
I am working on Glue pyspark script . In this script I read data from table and store it in pyspark dataframe. Now I want to add new column whose value will be calculated by passing existing columns to lambda and result will be returned.
So is it possible to call lambda service in Glue script ?
Hello,
I'm stuck with this error and I can't find anything help full.
I'm trying to migrate data between s3 to Redshift,
Note: i crawled both and both tables are in my glue databases
but when i'm running a job there is an error
An error occurred while calling o131.pyWriteDynamicFrame. Exception thrown in awaitResult:
I have a glue job which reads from Glue Catalog table which is in hudi format and after every read the reponse contains the whole dateset while I expect only the first run to contain the data but all subsequent runs to return the empty dataset (given there were no changes to the source hudi dataset)
My hudi config is following:
```
hudi_config = {
'className': 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.precombine.field': config['sort_key'],
'hoodie.datasource.write.partitionpath.field': config['partition_field'],
'hoodie.datasource.hive_sync.partition_fields': config['partition_field'],
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.HiveStylePartitionValueExtractor',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.hive_sync.assume_date_partitioning': 'false',
'hoodie.datasource.write.recordkey.field': config['primary_key'],
'hoodie.table.name': config['hudi_table'],
'hoodie.datasource.hive_sync.database': config['target_database'],
'hoodie.datasource.hive_sync.table': config['hudi_table'],
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.consistency.check.enabled': 'true',
'hoodie.cleaner.commits.retained': 10,
'path': f"s3://{config['target_bucket']}{config['s3_path']}",
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', # noqa: E501
'hoodie.bulkinsert.shuffle.parallelism': 200,
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.insert.shuffle.parallelism': 200,
'hoodie.datasource.hive_sync.support_timestamp': 'true',
# 'hoodie.datasource.write.operation': "insert"
'hoodie.datasource.write.operation': "upsert"
}
```
The reading part looks like this:
```
Read_node1 = glueContext.create_data_frame.from_catalog(
database="gluedatabase",
table_name="table",
transformation_ctx="Read_node1"
)
AWSGlueDataCatalog_node = DynamicFrame.fromDF(Read_node1, glueContext, "AWSGlueDataCatalog_node")
```
The result is being written to s3 bucket and always generates the same file.
Thank you
On the AWS Glue console page.When trying to run the job with parameters.In the job parameters section, while adding the key and value under the parameters. The page is highly unstable. Hence unable to add any key-value pair. Has anyone faced the same issue?
hello,
AWS recently announced that Glue crawlers can now create native delta lake tables (last December, https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/). We tested and it works fine. However, we would like to not use crawlers.
Is this the only way to create a native delta lake table for now? Is this planned to allow this through the Glue console table creation screen?
As a side note, it looks like terraform is still missing a "CreateNativeDeltaTable" option in their latest provider (they have an open issue for that).
Thanks.
Cheers,
Fabrice
I am planning to utilize catalogPartitionPredicate in one of my projects. I am unable to handle one of the scenarios. Below are the details:
1. Partition columns: Year,Month & Day
2. catalogPartitionPredicate: year>='2021' and month>='12'
If the year changes to 2022(2022-01-01) and I want to read data from 2021-12-01; the expression won't be able to handle as it will not allow to read 2022 data. I tried to concat the partition keys but it didn't work.
Is there any way to implement to_date functionality or any other workaround to handle this scenario?
I was able to include the DatabricksJDBC42.jar in my Glue Docker container used for local machine development ([link](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/)).
I am able to reach the host using Jupyter notebook, but I am getting an SSL type error
`Py4JJavaError: An error occurred while calling o80.load.
: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.`
My connection string looks like this:
`.option("url","jdbc:databricks://host.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/111111111111111/1111-111111-abcdefghi;AuthMech=3;UseNativeQuery=0;StripCatalogName=0;")\
.option("dbtable","select 1")\
.option("driver", "com.databricks.client.jdbc.Driver")\
.load()`
I used the same JDBC string in my code uploaded to our live account and the AWS Glue job runs and executes the queries in dbtable just fine. Its just in the local Docker Glue development container where we get this SSL error.
I tried adding a separate **option** for the sslConnection and sslCertLocation and placed tried the files in /root/,aws as well as the jupyter notebook folder. The cert is showing in directory listings and is correctly assigned but the jdbc connection is failing with the SSL error.
Anyone see this before or have a suggestion for next steps?
Thanks.
Trying to load tables from single source.. Source has data for EMP NAME , ADDRESS … target table A has EMP ID(Auto generated PK) and EMP NAME … table B has EMP ID(foreign key) ADDRESS ID ( Auto generated PK) AND ADDRESS…
Now how to load these two tables using AWS GLUE ?
No proper notes any where for this… can you help clarify
Attempting to run a very trivial Glue script locally via Docker I can't seem to connect to a mysql database running also in docker.
My docker setup is:
```yaml
version: '3.7'
services:
glue:
container_name: "dev-glue"
image: amazon/aws-glue-libs:glue_libs_3.0.0_image_01-arm64
ports:
- "4040:4040"
- "18080:18080"
volumes:
- ~/.aws:/home/glue_user/.aws
- /workspace:/home/glue_user/workspace/
environment:
- "AWS_PROFILE=$AWS_PROFILE"
- "AWS_REGION=us-west-2"
- "AWS_DEFAULT_REGION=us-west-2"
- "DISABLE_SSL=true"
stdin_open: true
mysql:
image: mysql:8.0
container_name: 'dev-mysql'
command: --default-authentication-plugin=mysql_native_password
restart: always
environment:
MYSQL_ROOT_PASSWORD: password
volumes:
- mysql-db:/var/lib/mysql
ports:
- '3306:3306'
volumes:
mysql-db:
```
And according to the documentation found here the following should work without a problem
```python
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
def df_mysql(glue_context: GlueContext, schema: str, table: str):
connection = {
"url": f"jdbc:mysql://dev-mysql/{schema}",
"dbtable": table,
"user": "root",
"password": "password",
"customJdbcDriverS3Path": "s3://my-bucket/mysql-connector-java-8.0.17.jar",
"customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"
}
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
data_frame.printSchema()
df_mysql(glueContext, "my_schema", "my_table")
```
However this fails with
```
Traceback (most recent call last):
File "/home/glue_user/workspace/local/local_mysql.py", line 25, in <module>
df_mysql(glueContext, "my_schema", "my_table")
File "/home/glue_user/workspace/local/local_mysql.py", line 19, in df_mysql
data_frame: DynamicFrame = glue_context.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection)
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/dynamicframe.py", line 608, in from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/context.py", line 228, in create_dynamic_frame_from_options
File "/home/glue_user/aws-glue-libs/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.getDynamicFrame.
: java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:14)
at org.apache.spark.sql.jdbc.glue.GlueJDBCOptions.<init>(GlueJDBCOptions.scala:17)
at org.apache.spark.sql.jdbc.glue.GlueJDBCSource$.createRelation(GlueJDBCSource.scala:29)
at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:878)
at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:86)
at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:172)
at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:967)
at com.amazonaws.services.glue.DataSource.getDynamicFrame(DataSource.scala:99)
at com.amazonaws.services.glue.DataSource.getDynamicFrame$(DataSource.scala:99)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:714)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
```
What's interesting here is its failing due to `java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver` but if I change the config to use something else, it will fail with the same error message.
Is there some other environment variable or configuration I'm supposed to set in order for this to work?