By using AWS re:Post, you agree to the AWS re:Post Terms of Use

SageMake jupyterlab kernel dies

0

I have created a ml.g4dn.2xlarge instance with 75GB storage. When trying to import torch from a notebook the kernel dies.

import torch

If I launch python from a terminal and import torch then I get no error.

My torch install:

sagemaker-user@default:~$ pip show torch
Name: torch
Version: 2.5.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /opt/conda/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autogluon.multimodal, autogluon.timeseries, bitsandbytes, fastai, lightning, pytorch-lightning, pytorch-metric-learning, timm, torchmetrics, torchvision

Is there any way I can debug this issue?

asked a month ago82 views
2 Answers
1
Accepted Answer

I tried importing the packages one-by-one from python command line. And, I found, this is the line that is causing python to crash:

from transformers import AutoProcessor

Error message is:

ERROR: Flag 'minloglevel' was defined more than once (in files 'src/error.cc' and 'home/conda/feedstock_root/build_artifacts/abseil-split_1720857154496/work/absl/log/flags.cc').

This is discussed in this SO question.

The issue has to do with an incompatible version of sentencepiece package.

I downgraded the package:

pip install sentencepiece==0.1.99

Now, I'm able to import all packages.

answered a month ago
profile picture
EXPERT
reviewed a month ago
1

The issue you're experiencing with the Jupyter kernel dying when importing torch in a notebook, while it works fine in a terminal, could be due to several factors. Here are some steps you can take to debug and potentially resolve the issue:

  1. Check kernel logs: Look for any error messages or stack traces in the Jupyter kernel logs. These logs might provide more detailed information about why the kernel is crashing.

  2. Verify memory usage: The ml.g4dn.2xlarge instance has 32 GB of RAM. Ensure that you're not running out of memory when importing torch. You can monitor memory usage using system tools or within your notebook.

  3. Update PyTorch: Make sure you're using the latest compatible version of PyTorch. You can try updating it using pip:

    pip install --upgrade torch
    
  4. Check CUDA compatibility: Ensure that the CUDA version installed on your instance is compatible with your PyTorch version.

  5. Restart the Jupyter kernel: Sometimes, simply restarting the kernel can resolve issues.

  6. Create a new conda environment: Try creating a fresh conda environment with only the necessary dependencies and see if the issue persists.

  7. Use a different kernel: Try using a different kernel, such as the Python 3 (Data Science) kernel, which comes pre-configured with many data science libraries.

  8. Increase instance size: If the issue persists, you might want to consider using a larger instance type with more memory and compute resources.

  9. Contact AWS Support: If none of the above steps resolve the issue, it might be worth reaching out to AWS Support for further assistance.

Remember that memory leaks can occur in Jupyter notebooks, especially when working with large datasets or complex models. It's a good practice to restart your kernel periodically and run garbage collection to free up memory.

If the problem continues, you might want to consider using SageMaker Studio instead of a standalone Jupyter notebook instance, as it offers more flexibility in terms of resource allocation and kernel management.
Sources
Set the Notebook Kernel - Amazon SageMaker
Machine Learning Service – Amazon SageMaker Studio Notebooks – AWS
Is there a memory leak with Sagemaker Jupyperlab python kernel? | AWS re:Post

profile picture
answered a month ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions