Has anyone been able to run the training steps (2.3 and 3.3) in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" ?

0

You can find the notebook by going to sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I did not change anything in the notebook. When the training starts, it errors out for me.

Cloudwatch:

	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1769] 2023-08-15 22:24:01,348 >> ***** Running training *****
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1770] 2023-08-15 22:24:01,348 >> Num examples = 1,054
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1771] 2023-08-15 22:24:01,348 >> Num Epochs = 1
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1772] 2023-08-15 22:24:01,348 >> Instantaneous batch size per device = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1773] 2023-08-15 22:24:01,348 >> Total train batch size (w. parallel, distributed & accumulation) = 16
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1774] 2023-08-15 22:24:01,348 >> Gradient Accumulation steps = 2
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1775] 2023-08-15 22:24:01,348 >> Total optimization steps = 66
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:01.356-04:00	[INFO|trainer.py:1776] 2023-08-15 22:24:01,349 >> Number of trainable parameters = 6,921,720,704
	2023-08-15T18:24:02.357-04:00	0%| | 0/66 [00:00<?, ?it/s]
	2023-08-15T18:24:07.358-04:00	╭───────────────────── Traceback (most recent call last) ──────────────────────╮

Training job in sagemaker:

AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "│ 154 │ │ │ raise RuntimeError( │ │ 155 │ │ │ │ "none of output has requires_grad=True," │ │ 156 │ │ │ │ " this checkpoint() is not necessary") │ │ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │ │ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else N │ │ 159 │ │ │ │ │ for inp in detached_inputs) │ │ 160 │ │ │ │ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in │ │ backward │ │ 197 │ # The reason we repeat same the comment below is that , exit code: 1

The above is the output of section 3.3 in the notebook, but 2.3 also has the same issue. I can manually train the model (instead of using step 2.3) if I go to sagemaker studio -> Falcon 7B Instruct BF16 -> train tab. However, I can't for the step 3.3, it also results the the above issue. I also tried changing the training parameters without much success.

Rafael
posta 9 mesi fa245 visualizzazioni
1 Risposta
0

Hello,

I understand that you are trying to run sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" Following the below steps.

[+] sagemaker studio -> home -> jumpstart -> Falcon 7B Instruct BF16 -> notebook

I replicated the scenario at my end and could run the in the sagemaker jumpstart notebook "Introduction to SageMaker JumpStart - Text Generation with Falcon models" successfully.

I followed the same steps as mentioned. I request to retry at you end if the issue persist, please reach out to AWS Support (Sagemaker) along with your issue or use case in detail, and we would be happy to assist you further.

I hope you find the above information helpful.

Thank you.

====Reference==== [+] Creating support cases and case management - https://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-casehttps://docs.aws.amazon.com/awssupport/latest/user/case-management.html#creating-a-support-case

AWS
con risposta 9 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande