Has anyone been able to run the training steps (2.3 and 3.3) in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" ?

0

You can find the notebook by going to sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I did not change anything in the notebook. When the training starts, it errors out for me.

Cloudwatch:

	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1769] 2023-08-15 22:24:01,348 >> ***** Running training *****
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1770] 2023-08-15 22:24:01,348 >> Num examples = 1,054
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1771] 2023-08-15 22:24:01,348 >> Num Epochs = 1
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1772] 2023-08-15 22:24:01,348 >> Instantaneous batch size per device = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1773] 2023-08-15 22:24:01,348 >> Total train batch size (w. parallel, distributed & accumulation) = 16
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1774] 2023-08-15 22:24:01,348 >> Gradient Accumulation steps = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1775] 2023-08-15 22:24:01,348 >> Total optimization steps = 66
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:02.357-04:00	0%| | 0/66 [00:00<?, ?it/s]
	2023-08-15T18:24:07.358-04:00	╭───────────────────── Traceback (most recent call last) ──────────────────────╮

Training job in sagemaker:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "│ 154 │ │ │ raise RuntimeError( │ │ 155 │ │ │ │ "none of output has requires_grad=True," │ │ 156 │ │ │ │ " this checkpoint() is not necessary") │ │ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │ │ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else N │ │ 159 │ │ │ │ │ for inp in detached_inputs) │ │ 160 │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in │ │ backward │ │ 197 │ # The reason we repeat same the comment below is that , exit code: 1

The above is the output of section 3.3 in the notebook, but 2.3 also has the same issue. I can manually train the model (instead of using step 2.3) if I go to sagemaker studio -> Falcon 7B Instruct BF16 -> train tab. However, I can't for the step 3.3, it also results the the above issue. I also tried changing the training parameters without much success.

Rafael
asked 8 months ago234 views
1 Answer
0

Hello,

I understand that you are trying to run sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" Following the below steps.

[+] sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I replicated the scenario at my end and could run the in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" successfully.

I followed the same steps as mentioned. I request to retry at you end if the issue persist, please reach out to AWS Support (Sagemaker) along with your issue or use case in detail, and we would be happy to assist you further.

I hope you find the above information helpful.

Thank you.

====Reference==== [+] Creating support cases and case management - https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-casehttps://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions