glueContext handling java.io.CharConversionException with db2 driver

0

We are receiving the java.io.CharConversionException when trying to read data from a DB2 database that has characters outside the UTF-8 encoding. I tried adding an option("encoding", "ISO-8859-1") and option("charset", "ISO-8859-1") but neither seems to have any effect at all. Is it possible to ask the glueContext to use a specific type of encoding?

If not, what options do we have for handling characters that are throwing the CharConversionException? We have been excluding the rows specifically though SQL but this is not a tenable solution.

질문됨 일 년 전415회 조회
4개 답변
0

Please check if the below format works for you. I have set other options similar to this in GLue 3.0

sconf = SparkConf()

spark_conf.setAll([
    ('spark.executor.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3'), 
    ('spark.driver.extraJavaOptions','-Ddb2.jcc.charsetDecoderEncoder=3')
    ])

sc = SparkContext(conf=sconf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
profile pictureAWS
답변함 일 년 전
  • We have this in our code already, but unfortunately it does not have any effect. We have been able to set 'spark.sql.adaptive.enabled' to false, and that worked but the above code does not.

    I believe the difference might be due to this spark setting acting upon the already cluster, whereas the java level JVM settings need to be applied before the spark cluster is actually instantiated. But, this is conjecture as I do not know how Glue 2/3 initializes its cluster. All I know is that it is faster start up than Glue 1, which makes me think it is pulled from a "warm" image.

0

See the IBM Support article for more details related to this CharConversionException error https://www.ibm.com/support/pages/sqlexception-message-caught-javaiocharconversionexception-and-errorcode-4220

After setting the parameters

--conf 'spark.executor.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3' 
--conf 'spark.driver.extraJavaOptions=-Ddb2.jcc.charsetDecoderEncoder=3'

an exception will not be thrown when a non-UTF8 character is encountered but rather it will be substituted by its equivalent Unicode replacement character.

profile pictureAWS
답변함 일 년 전
0

Hi @ananthtm thanks for the reply. We have previously come across this solution, however the challenge we found was that unless we used Glue V1.0 we were unable to set these specific parameters. We tried with both 2.0 and 3.0 to set these settings in the configuration file but they were ignored. I believe it may be some issue with the VM for glue 2.0/3.0 are in some kind of warm-state and are not being loaded from a "cold-state" so there is no point at which these configuration parameters are being pulled in and configured in the Spark cluster.

If you know of a way that we can set these configuration values in our Glue 3.0 (or in future 4.0) settings then we would be able solve multiple issues we have been having with illegal characters.

Lastly, the reason we are using the V3.0 is because v1.0 is significantly slower to startup. Is there a possiblity that these SPARK configuration options can be configured in the AWS account or job level?

답변함 일 년 전
0

I have just worked with this issue and adding the following job parameter helped read the data without any charset error:

+++++++ --java-options -Ddb2.jcc.charsetDecoderEncoder=3 +++++++

Thank you!

AWS
Aravind
답변함 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠