Installing s3fs causes errors in JupyterLab on SageMaker

0

On a freshly created JupyterLab Space instance (image: SageMaker Distribution 1.6) on SageMaker Studio in AWS, I open the terminal and run pip install s3fs, and it returns some errors during install:

sagemaker-user@default:~$ pip install s3fs
Collecting s3fs
  Downloading s3fs-2024.3.1-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: aiobotocore<3.0.0,>=2.5.4 in /opt/conda/lib/python3.10/site-packages (from s3fs) (2.12.1)
Collecting fsspec==2024.3.1 (from s3fs)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /opt/conda/lib/python3.10/site-packages (from s3fs) (3.9.3)
Requirement already satisfied: botocore<1.34.52,>=1.34.41 in /opt/conda/lib/python3.10/site-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs) (1.34.51)
Requirement already satisfied: wrapt<2.0.0,>=1.10.10 in /opt/conda/lib/python3.10/site-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs) (1.16.0)
Requirement already satisfied: aioitertools<1.0.0,>=0.5.1 in /opt/conda/lib/python3.10/site-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs) (0.11.0)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (4.0.3)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from botocore<1.34.52,>=1.34.41->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.0.1)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.10/site-packages (from botocore<1.34.52,>=1.34.41->aiobotocore<3.0.0,>=2.5.4->s3fs) (2.9.0)
Requirement already satisfied: urllib3<2.1,>=1.25.4 in /opt/conda/lib/python3.10/site-packages (from botocore<1.34.52,>=1.34.41->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.26.18)
Requirement already satisfied: idna>=2.0 in /opt/conda/lib/python3.10/site-packages (from yarl<2.0,>=1.0->aiohttp!=4.0.0a0,!=4.0.0a1->s3fs) (3.6)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.34.52,>=1.34.41->aiobotocore<3.0.0,>=2.5.4->s3fs) (1.16.0)
Downloading s3fs-2024.3.1-py3-none-any.whl (29 kB)
Downloading fsspec-2024.3.1-py3-none-any.whl (171 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 172.0/172.0 kB 18.4 MB/s eta 0:00:00
Installing collected packages: fsspec, s3fs
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.6.0
    Uninstalling fsspec-2023.6.0:
      Successfully uninstalled fsspec-2023.6.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-ai 2.11.0 requires faiss-cpu, which is not installed.
datasets 2.18.0 requires fsspec[http]<=2024.2.0,>=2023.1.0, but you have fsspec 2024.3.1 which is incompatible.
jupyter-scheduler 2.5.1 requires fsspec==2023.6.0, but you have fsspec 2024.3.1 which is incompatible.
Successfully installed fsspec-2023.6.0 s3fs-2024.3.1

After installation, it is no longer possible to import transformers.Trainer or datasets:

TypeError                                 Traceback (most recent call last)
File [/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1099](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py#line=1098), in _LazyModule._get_module(self, module_name)
   1098 try:
-> 1099     return importlib.import_module("." + module_name, self.__name__)
   1100 except Exception as e:

File [/opt/conda/lib/python3.10/importlib/__init__.py:126](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/importlib/__init__.py#line=125), in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File [/opt/conda/lib/python3.10/site-packages/transformers/trainer.py:162](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/trainer.py#line=161)
    161 if is_datasets_available():
--> 162     import datasets
    164 if is_torch_tpu_available(check_device=False):

File [/opt/conda/lib/python3.10/site-packages/datasets/__init__.py:18](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/datasets/__init__.py#line=17)
     16 __version__ = "2.18.0"
---> 18 from .arrow_dataset import Dataset
     19 from .arrow_reader import ReadInstruction

File [/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py:66](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/datasets/arrow_dataset.py#line=65)
     64 from tqdm.contrib.concurrent import thread_map
---> 66 from . import config
     67 from .arrow_reader import ArrowReader

File [/opt/conda/lib/python3.10/site-packages/datasets/config.py:41](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/datasets/config.py#line=40)
     40 DILL_VERSION = version.parse(importlib.metadata.version("dill"))
---> 41 FSSPEC_VERSION = version.parse(importlib.metadata.version("fsspec"))
     42 PANDAS_VERSION = version.parse(importlib.metadata.version("pandas"))

File [/opt/conda/lib/python3.10/site-packages/packaging/version.py:54](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/packaging/version.py#line=53), in parse(version)
     46 """Parse the given version string.
     47 
     48 >>> parse('1.0.dev1')
   (...)
     52 :raises InvalidVersion: When the version string is not a valid version.
     53 """
---> 54 return Version(version)

File [/opt/conda/lib/python3.10/site-packages/packaging/version.py:198](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/packaging/version.py#line=197), in Version.__init__(self, version)
    197 # Validate the version and parse it into pieces
--> 198 match = self._regex.search(version)
    199 if not match:

TypeError: expected string or bytes-like object

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 1
----> 1 from transformers import Trainer

File <frozen importlib._bootstrap>:1075, in _handle_fromlist(module, fromlist, import_, recursive)

File [/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1089](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py#line=1088), in _LazyModule.__getattr__(self, name)
   1087     value = self._get_module(name)
   1088 elif name in self._class_to_module.keys():
-> 1089     module = self._get_module(self._class_to_module[name])
   1090     value = getattr(module, name)
   1091 else:

File [/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py:1101](https://cxvx4off4bxsa80.studio.us-east-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/transformers/utils/import_utils.py#line=1100), in _LazyModule._get_module(self, module_name)
   1099     return importlib.import_module("." + module_name, self.__name__)
   1100 except Exception as e:
-> 1101     raise RuntimeError(
   1102         f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
   1103         f" traceback):\n{e}"
   1104     ) from e

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
expected string or bytes-like object

Because of this, I can't use datasets.load_from_disk or datasets.save_to_disk to an S3 bucket. I'm trying to create a Huggingface Estimator training job from JupyterLab by first preprocessing the data, saving the data to S3, then calling huggingface_estimator.fit(), similar to this tutorial. Without s3fs, datasets.save_to_disk says it requires it, but after installing s3fs, I can no longer import datasets.

Modarn
asked 23 days ago164 views
1 Answer
0

Sagemaker offers a number of input modes for accessing training data on S3.

File mode presents a file system view of the dataset to the training container, which is similar to functionality of s3fs

Edit: if you are using SageMaker Estimator, it supports reading and writing to S3 buckets Refer to examples at Create and Run a Training Job

AWS
EXPERT
Mike_L
answered 23 days ago
  • Thank you, I will try that. Should I make a bug report somewhere about s3fs though?

  • s3fs is not a AWS product. You can check the issues log at their Github at https://github.com/fsspec/s3fs/issues

  • Oh ok, I actually already did create an issue there before this post, I didn't know if it'd be better on AWS.

    Also, with the input modes page you provided, it seems to be about the training script that is called by the Estimator and how it accesses the files, which is helpful. But I also need to save the preprocessed dataset to S3 directly from the notebook that creates the job, before it runs the Estimator, and I don't see how to turn on "File mode" independently of an Estimator.

  • sagemaker.estimator.Estimator supports S3 buckets. I have updated my post

  • Sorry, I don't think I was clear. I need to be able to pull data from S3 even without an Estimator or training job. Just grab raw data into a notebook for processing.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions