how to use sagemaker with input mode FastFile with files that has Chinese in their name?

0

This post is both a bug report and a question. We are trying to use SageMaker to train a model and everything is quite standard. Since we have a lot of images, we'll suffer from a super long image downloading time if we don't change the input_mode to FastFile. Then I struggled to successfully load image in the container. In my dataset there are a lot of samples whose name contains Chinese. When I started debugging because I could not properly load files, I found that when sagemaker mounts the data from s3, it didn't take care of the encoding correctly. Here is an image name and the image path inside the training container:

七年级上_第10章分式_七年级上_第10章分式_1077759_title_0-0_4_mathjax

/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png

This is not neat but still I can have the right path in the container. The problem is that I'm not able to read the file even though the path exists: what I mean is

os.path.exists('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png') gives true but cv2.imread('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png') returns None. Then I tried to open the file and fortunately it gives an error

The code is with open('/opt/ml/input/data/validation/\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_\u4E03\u5E74\u7EA7\u4E0A_\u7B2C10\u7AE0\u5206\u5F0F_1077759_title_0-0_4_mathjax.png', 'rb') as f: a = f.read() and it gives me the error

OSError: [Errno 107] Transport endpoint is not connected

I tried to load a file in the same folder whose name doesn't contain any Chinese. Everything works well in this case so I'm sure that the Chinese characters in the filenames are causing problems. I wonder if there is a quick walk around so I don't need to rename maybe 80% of the data in s3.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions