Error encountered while try to get user data - java.lang.NullPointerExcepti

0

I created a Glue job, and was trying to read a single parquet file (5.2GB) into AWS Glue's dynamic dataframe,

datasource0 = glueContext.create_dynamic_frame.from_options(  
    connection_type="s3",  
    connection_options={"paths": \\["s3://my-bucket-name/path"\]},  
    format="parquet"  
)  
  
then do something around datasource0  

Job info:

  • Spark2.4, Python3, Glue 2.0
  • Worker type G.2x - 8 vCPU, 32G Memory

Errors from CloudWatch:

[1] NullPointerException

2020-11-13 00:27:56,873 ERROR \[readingParquetFooters-ForkJoinPool-1-worker-13] util.UserData (UserData.java:getUserData(70)): Error encountered while try to get user data  
java.lang.NullPointerException  
	at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:871)  
	at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.getUserData(UserData.java:66)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.<init>(UserData.java:39)  
	at com.amazon.ws.emr.hadoop.fs.util.UserData.ofDefaultResourceLocations(UserData.java:52)  
	at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.buildSTSClient(AWSSessionCredentialsProviderFactory.java:52)  
	at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:17)  
	at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:22)  
	at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:25)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:171)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:103)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96)  
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43)  
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:220)  
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:860)  
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1319)  
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:790)  
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:207)  
	at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)  
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:498)  
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)  
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:544)  
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:538)  
	at org.apache.spark.util.ThreadUtils$$anonfun$3$$anonfun$apply$1.apply(ThreadUtils.scala:287)  
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)  
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)  
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)  
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)  
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)  
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)  
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)  

[2] IllegalArgumentException

2020-11-13 00:28:07,339 ERROR \[Executor task launch worker for task 21] executor.Executor (Logging.scala:logError(91)): Exception in task 20.0 in stage 1.0 (TID 21)  
java.lang.IllegalArgumentException: Illegal Capacity: -168  
	at java.util.ArrayList.<init>(ArrayList.java:157)  
	at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1163)  
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)  
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)  
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)  
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)  
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)  
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)  
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)  
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)  
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)  
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1817)  
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)  
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1168)  
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)  
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)  
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)  
	at org.apache.spark.scheduler.Task.run(Task.scala:121)  
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)  
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)  
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  
	at java.lang.Thread.run(Thread.java:748)  

Any insights? Thanks in advance!

Edited by: jugu on Nov 12, 2020 8:41 PM

Edited by: jugu on Nov 12, 2020 8:43 PM

jugu
gefragt vor 4 Jahren1320 Aufrufe
1 Antwort
0

Hi, is there any update of this issue?

beantwortet vor 3 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen