- Newest
- Most votes
- Most comments
Hello,
A SageMaker kernel could die due to resource utilisation load or an issue within the code or a third-party library.
Please check the system resource utilisation in order to ensure that the operations are running at appropriate load levels.
To check SageMaker notebook instance resources, enter the following commands in the Notebook terminal:
- To check memory utilisation: free -h
- To check CPU utilisation: top
- To check disk utilisation: df -h
If you see high utilisation of CPU, memory, or disk utilisation, please try these solutions:
- Restart the notebook instance and try again.
- Review your SageMaker notebook instance type to verify that it's properly scoped and configured for your jobs.
=> If a resource crunch is observed, kindly switch to a larger instance type and check if the issue gets fixed.
[+] https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-troubleshoot-connectivity/
You may also refer to the following documentation to check for CPU, memory and Disk utilisation : https://aws.amazon.com/premiumsupport/knowledge-center/open-sagemaker-jupyter-notebook/#:~:text=High%20CPU%20or%20memory%20utilization
Please try running the code separately and execute rows one by one if possible to identify the issue at a more granular level and check if everything works fine with respect to the code.
Additionally, please try the following to fix the issue:
- Close the active sessions and clear the browser cache/cookies. When you have a large number of active sessions, the kernel might take longer to load in the browser.
- Open the SageMaker Notebook in a different browser. Check if the kernel connects successfully.
- Restart your notebook instance.
I hope this helps!
Relevant content
- asked 6 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 4 months ago
Usually a kernel will die for one of two reasons: 1) runs out of memory, 2) a bug in the code or a library. Try running this with a subset of your dataset and see if it runs to completion without error. This would eliminate the possibility of a bug. Then choose an instance type with more memory (ml.t3.xlarge has 16GB or RAM) and see if that is enough memory for your dataset.