Has anyone been able to run the training steps (2.3 and 3.3) in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" ?

0

You can find the notebook by going to sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I did not change anything in the notebook. When the training starts, it errors out for me.

Cloudwatch:

	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1769] 2023-08-15 22:24:01,348 >> ***** Running training *****
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1770] 2023-08-15 22:24:01,348 >> Num examples = 1,054
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1771] 2023-08-15 22:24:01,348 >> Num Epochs = 1
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1772] 2023-08-15 22:24:01,348 >> Instantaneous batch size per device = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1773] 2023-08-15 22:24:01,348 >> Total train batch size (w. parallel, distributed & accumulation) = 16
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1774] 2023-08-15 22:24:01,348 >> Gradient Accumulation steps = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1775] 2023-08-15 22:24:01,348 >> Total optimization steps = 66
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:02.357-04:00	0%| | 0/66 [00:00<?, ?it/s]
	2023-08-15T18:24:07.358-04:00	╭───────────────────── Traceback (most recent call last) ──────────────────────╮

Training job in sagemaker:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "│ 154 │ │ │ raise RuntimeError( │ │ 155 │ │ │ │ "none of output has requires_grad=True," │ │ 156 │ │ │ │ " this checkpoint() is not necessary") │ │ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │ │ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else N │ │ 159 │ │ │ │ │ for inp in detached_inputs) │ │ 160 │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in │ │ backward │ │ 197 │ # The reason we repeat same the comment below is that , exit code: 1

The above is the output of section 3.3 in the notebook, but 2.3 also has the same issue. I can manually train the model (instead of using step 2.3) if I go to sagemaker studio -> Falcon 7B Instruct BF16 -> train tab. However, I can't for the step 3.3, it also results the the above issue. I also tried changing the training parameters without much success.

Rafael
已提問 9 個月前檢視次數 246 次
1 個回答
0

Hello,

I understand that you are trying to run sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" Following the below steps.

[+] sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I replicated the scenario at my end and could run the in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" successfully.

I followed the same steps as mentioned. I request to retry at you end if the issue persist, please reach out to AWS Support (Sagemaker) along with your issue or use case in detail, and we would be happy to assist you further.

I hope you find the above information helpful.

Thank you.

====Reference==== [+] Creating support cases and case management - https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-casehttps://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
已回答 9 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南