Please check if the below format works for you. I have set other options similar to this in GLue 3.0
sconf = SparkConf() spark_conf.setAll([ ('spark.executor.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3'), ('spark.driver.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3') ]) sc = SparkContext(conf=sconf) glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext)
See the IBM Support article for more details related to this CharConversionException error https://www.ibm.com/support/pages/sqlexception-message-caught-javaiocharconversionexception-and-errorcode-4220
After setting the parameters
--conf 'spark.executor.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3' --conf 'spark.driver.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3'
an exception will not be thrown when a non-UTF8 character is encountered but rather it will be substituted by its equivalent Unicode replacement character.
Hi @ananthtm thanks for the reply. We have previously come across this solution, however the challenge we found was that unless we used Glue V1.0 we were unable to set these specific parameters. We tried with both 2.0 and 3.0 to set these settings in the configuration file but they were ignored. I believe it may be some issue with the VM for glue 2.0/3.0 are in some kind of warm-state and are not being loaded from a "cold-state" so there is no point at which these configuration parameters are being pulled in and configured in the Spark cluster.
If you know of a way that we can set these configuration values in our Glue 3.0 (or in future 4.0) settings then we would be able solve multiple issues we have been having with illegal characters.
Lastly, the reason we are using the V3.0 is because v1.0 is significantly slower to startup. Is there a possiblity that these SPARK configuration options can be configured in the AWS account or job level?
- asked a year ago
- asked 8 months ago
- AWS OFFICIALUpdated 2 years ago
- EXPERTpublished 8 months ago
- A Brief Primer to Onboarding Data To a Healthcare and Life Sciences Data Mesh Leveraging AWS ServicesEXPERTpublished 7 months ago
We have this in our code already, but unfortunately it does not have any effect. We have been able to set 'spark.sql.adaptive.enabled' to false, and that worked but the above code does not.
I believe the difference might be due to this spark setting acting upon the already cluster, whereas the java level JVM settings need to be applied before the spark cluster is actually instantiated. But, this is conjecture as I do not know how Glue 2/3 initializes its cluster. All I know is that it is faster start up than Glue 1, which makes me think it is pulled from a "warm" image.