glueContext handling java.io.CharConversionException with db2 driver

0

We are receiving the java.io.CharConversionException when trying to read data from a DB2 database that has characters outside the UTF-8 encoding. I tried adding an option("encoding", "ISO-8859-1") and option("charset", "ISO-8859-1") but neither seems to have any effect at all. Is it possible to ask the glueContext to use a specific type of encoding?

If not, what options do we have for handling characters that are throwing the CharConversionException? We have been excluding the rows specifically though SQL but this is not a tenable solution.

已提問 1 年前檢視次數 440 次
4 個答案
0

Please check if the below format works for you. I have set other options similar to this in GLue 3.0

sconf = SparkConf()

spark_conf.setAll([
    ('spark.executor.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3'), 
    ('spark.driver.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3')
    ])

sc = SparkContext(conf=sconf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
profile pictureAWS
已回答 1 年前
  • We have this in our code already, but unfortunately it does not have any effect. We have been able to set 'spark.sql.adaptive.enabled' to false, and that worked but the above code does not.

    I believe the difference might be due to this spark setting acting upon the already cluster, whereas the java level JVM settings need to be applied before the spark cluster is actually instantiated. But, this is conjecture as I do not know how Glue 2/3 initializes its cluster. All I know is that it is faster start up than Glue 1, which makes me think it is pulled from a "warm" image.

0

See the IBM Support article for more details related to this CharConversionException error https://www.ibm.com/support/pages/sqlexception-message-caught-javaiocharconversionexception-and-errorcode-4220

After setting the parameters

--conf 'spark.executor.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3' 
--conf 'spark.driver.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3'

an exception will not be thrown when a non-UTF8 character is encountered but rather it will be substituted by its equivalent Unicode replacement character.

profile pictureAWS
已回答 1 年前
0

Hi @ananthtm thanks for the reply. We have previously come across this solution, however the challenge we found was that unless we used Glue V1.0 we were unable to set these specific parameters. We tried with both 2.0 and 3.0 to set these settings in the configuration file but they were ignored. I believe it may be some issue with the VM for glue 2.0/3.0 are in some kind of warm-state and are not being loaded from a "cold-state" so there is no point at which these configuration parameters are being pulled in and configured in the Spark cluster.

If you know of a way that we can set these configuration values in our Glue 3.0 (or in future 4.0) settings then we would be able solve multiple issues we have been having with illegal characters.

Lastly, the reason we are using the V3.0 is because v1.0 is significantly slower to startup. Is there a possiblity that these SPARK configuration options can be configured in the AWS account or job level?

已回答 1 年前
0

I have just worked with this issue and adding the following job parameter helped read the data without any charset error:

+++++++ --java-options -Ddb2.jcc.charsetDecoderEncoder=3 +++++++

Thank you!

AWS
Aravind
已回答 2 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南