Questions tagged with AWS Glue
Content language: English
Sort by most recent
Hi I have written a few Glue Jobs and not faced this situation , but all of a sudden this has started appearing for a new job that I wrote. I am using the below code to write data to S3 . The S3 path is "s3://...." ``` unionData_df.repartition(1).write.mode("overwrite").parquet(test_path) ``` In my test env, when I first ran the glue job , it created an empty file with the suffix _$folder$ The same happened in Prod also . My other jobs do not have this problem. Why is it creating this file ? How to avoid it ? Any pointes on why is it not happening for other jobs but for this one? What should I be checking ? Note , I think the file gets created the first time the prefix/folder is created. Some blogposts suggest to change the S3 path to s3a , but I am not sure if that is the right thing to do .
Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data? DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0") df1 = DataSource0.toDF()
Hello All, I have created simple python-shell script as mentioned below code is running fine from my local system and I am also able to connect to cluster from my local system. But when try running python-shell script as glue then getting following error: ``` import sys import psycopg2 rds_host = "hostname" name = "aaaaaaa" password = "XXXXXXX" db_name = "bbb" conn = psycopg2.connect(host=rds_host, user=name, password=password, dbname=db_name) with conn.cursor() as cur: query="CALL test_vals()" cur.execute(query) conn.commit() cur.close() ``` cloudwatch error log ``` conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: could not connect to server: Connection timed out Is the server running on host "hostname" (XX.XXX.XX.XX) and accepting TCP/IP connections on port 5432? During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tmp/runscript.py", line 215, in <module> ``` I have not added any Connections in job properties . Please help .
I get strange errors in the data quality tab of the glue studio, which seem to contradict the documentation in https://docs.aws.amazon.com/glue/latest/dg/dqdl.html#dqdl-rule-types-IsComplete: ``` Rule_13 IsComplete "abc" Rule failed Unsupported nested column type of column abc: ArrayType(DoubleType,true)! ``` Contradicts: "Supported column types: Any column type" ``` Rule_14 IsComplete "xyz" Rule failed Value: 0.6140889354693816 does not meet the constraint requirement! ``` Contradicts: "Checks whether all of the values in a column are complete (non-null)." Is there any better way to understand what is going wrong in my case?
Is there a way to invert rules in the Data Quality Definition Language (DQDL). For example I would like to make sure, that a certain (error) column does not exit.
Hi ! I have been a long time trying to solve this error without success. I have already verified the existence of the database, the region, etc. Any help would be great. Thanks!![Enter image description here](/media/postImages/original/IMYIujMbbqT-OCdcS5WfwBlA)
I have manually created an lake formation tag key :classification with tag value :non pii and associated to tag to table columns,here i want use glue. Job using detetect pii and custome code (boto 3 library)to over write the same lake formation tag key : classification with tag value : pii , pls clarify the same by using glue job can we over write detect pii with glue .
How to include pyspark libraries when running a AWS Glue Python job.
I am trying to create standard glue external tables with terraform, replacing a number of lake formation governed tables. The original lake governed tables had no table specific permissions granted and where dropped from the control panel. Any attempt to create a standard external table from any means (cli, boto3, glue console, lake formation console, terraform) I receive a `AlreadyExistsException` on that table. When calling the table from the cli or with boto3 it can't find the corresponding table (EntityNotFound or similar) If in either terraform or the control panel the table type is changed to governed, the table is created successfully (so with all the same settings as before i.e. region, path, classification etc). We would like to be able to create these tables as standard external tables, but seem completely unable to with no idea whether this is a bug of some sort (likely in lake formation?) or we're missing something? Any help is appreciated
We imported MIMIC IV data (which is already in **.ndjson format**) into HealthLake data store & exported it, but at the time of import we found that it is importing the **"lastupdated" column thrice**. Also, when exported we're seeing it got updated thrice a time. While querying in Athena, it showed this error. **If there any query present to remove duplicate keys from a table row, pls do share it. Also, if anyone found a solution on this error pls share.** Thanks in advance. **Query Id**: de44bccc-36af-488b-8c3d-bcf7e6d9360f ![Row is not a valid JSON Object - JSONException: Duplicate key "lastupdated"](/media/postImages/original/IM2sULF2gSRAyuOVZED3p3VQ)
Dear AWS re:Post For my ETL jobs, I read most of my date from RDS, but some I read directly from a table whose data sit on S3. I only discovered today that each job run generated not insignificant cost using the GetObject API and I'm trying to reconstruct how the calls work. I have approximately 60 000 files that sit S3 that for this table, but I'm using a push down predicate to read only 6 000 for my ETL. I think the GetObject cost associated with my ETL is around 50 000 000 Get Object calls (storage class is S3 Standard), i.e. 20$/0.0004*1000. As I'm only expecting 5000-6000 files to be read, I'm assuming that create_dynamic_frame.from_catalog reads each file on S3 using the partNumber option to cut the files in pieces. As the maximum number of parts is 10 000, that fits more or less my my estimates. However I couldn't find more details on the documentation on how the S3 calls work in create_dynamic_frame.from_catalog. Thanks a lot for your help !
I have a Glue Job for migrating data from a Postgres database in RDS to a S3 bucket in Parquet format and a Crawler that connects to Postgres to infer the table schema. Previously, the Glue Connection used by both was configured to authenticate to Postgres via Username and Password, however now I would like it to authenticate via credentials stored in Secrets Manager instead. After updating the Glue Connection to use Secrets Manager, the Glue Job is failing with the following error: ``` 2022-12-14 14:58:07,092 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File "/tmp/parquet-job.py", line 25, in <module> database=glue_source_database, table_name=table, transformation_ctx="Datasource") File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 787, in from_catalog return self._glue_context.create_dynamic_frame_from_catalog(db, table_name, redshift_tmp_dir, transformation_ctx, push_down_predicate, additional_options, catalog_id, **kwargs) File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 186, in create_dynamic_frame_from_catalog makeOptions(self._sc, additional_options), catalog_id), File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o75.getCatalogSource. : java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:349) at scala.None$.get(Option.scala:347) at com.amazonaws.services.glue.util.DataCatalogWrapper.$anonfun$getJDBCConf$1(DataCatalogWrapper.scala:218) at scala.util.Try$.apply(Try.scala:209) at com.amazonaws.services.glue.util.DataCatalogWrapper.getJDBCConf(DataCatalogWrapper.scala:209) at com.amazonaws.services.glue.GlueContext.getGlueNativeJDBCSource(GlueContext.scala:487) at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:320) at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) ``` This is the code for the Glue Job script: ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME', 'TABLE_NAME', 'PARQUET_LOCATION', 'PARQUET_TABLE_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) glue_source_database = "postgres-db" table = args['TABLE_NAME'].replace("-", "_") output_s3_path = args['PARQUET_LOCATION'] + \ "/" + args['PARQUET_TABLE_NAME'] Datasource = glueContext.create_dynamic_frame.from_catalog( database=glue_source_database, table_name=table, transformation_ctx="Datasource") # the script fails at this line print("Items Count: ", Datasource.count()) Datasource.printSchema() Transform = ApplyMapping.apply(frame=Datasource, mappings=[ ("id", "int", "id", "int"), ("name", "string", "name", "string")], transformation_ctx="Transform") DF = Transform.toDF() formatted_path = output_s3_path.replace("_", "-") DF.write.mode('overwrite').format( 'parquet').save(formatted_path) job.commit() ``` I have confirmed that the IAM role for AWS Glue has permission to access my secret. The Crawler is running successfully using the same connection as the Glue Job. I also tried reverting the connection to use Username and Password, and the job succeeded. Is there anything that I am missing? Could this possibly be a bug with AWS Glue?