Deepspeed Error for fine-tuning Mistral 7b on JumpStart
I'm having some troubles with fine-tuning Mistral 7b on JumpStart. This is the error:
ErrorMessage "raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
ERROR:root:Subprocess script failed with return code: 1
Traceback (most recent call last)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling
subprocess.run(command, shell=shell, check=True)
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.
CalledProcessError
Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred
File "/opt/ml/code/transfer_learning.py", line 68, in <module>
run_with_args(args)
File "/opt/ml/code/transfer_learning.py", line 42, in run_with_args
subprocess.run_with_error_handling(command)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 12, in run_with_error_handling
raise RuntimeError(e)
RuntimeError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1."
ErrorMessage "raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:root:Subprocess script failed with return code: 1 Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 9, in run_with_error_handling subprocess.run(command, shell=shell, check=True) File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess. CalledProcessError Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1. During handling of the above exception, another exception occurred File "/opt/ml/code/transfer_learning.py", line 68, in <module> run_with_args(args) File "/opt/ml/code/transfer_learning.py", line 42, in run_with_args subprocess.run_with_error_handling(command) File "/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_script_utilities/subprocess.py", line 12, in run_with_error_handling raise RuntimeError(e) RuntimeError: Command '['deepspeed', '--num_gpus=4', '/opt/conda/lib/python3.10/site-packages/sagemaker_jumpstart_huggingface_script_utilities/fine_tuning/run_clm.py', '--deepspeed', 'ds_config.json', '--model_name_or_path', '/tmp', '--train_file', '/opt/ml/input/data/training', '--do_train', '--output_dir', '/opt/ml/model', '--num_train_epochs', '1', '--gradient_accumulation_steps', '8', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--logging_steps', '8', '--warmup_ratio', '0.1', '--learning_rate', '6e-06', '--weight_decay', '0.2', '--seed', '10', '--max_input_length', '-1', '--validation_split_ratio', '0.2', '--train_data_split_seed', '0', '--max_steps', '-1', '--early_stopping_patience', '3', '--early_stopping_threshold', '0.0', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--max_grad_norm', '1.0', '--label_smoothing_factor', '0.0', '--logging_strategy', 'steps', '--save_strategy', 'steps', '--save_steps', '500', '--dataloader_num_workers', '0', '--lr_scheduler_type', 'constant_with_warmup', '--warmup_steps', '0', '--evaluation_strategy', 'steps', '--eval_steps', '20', '--lora_r', '8', '--lora_alpha', '16.0', '--lora_dropout', '0.05', '--bits', '16', '--quant_type', 'nf4', '--lora_finetuning', '--load_best_model_at_end', '--bf16', '--instruction_tuned', '--gradient_checkpointing', '--save_total_limit', '1', '--double_quant']' returned non-zero exit status 1."
I'm understanding as that I have troubles with parameter low_cpu_mem_usage=True or passing a device_map. These options are not compatible with DeepSpeed Zero-3.
I'm fine-tuning with SageMaker Studio's UI using the newest SageMaker Studio version. The model is default model in default artifact location.
For the datasets, I've tried using various datasets but caught by same error every time. I've used default dataset in default training dataset location and my custom datasets. I've used English only datasets, Korean only datasets and mixed.
these are my hyper parameters in recent try:
adam_beta1 0.9
adam_beta2 0.999
adam_epsilon 1e-8
auto_find_batch_size False
bf16 True
bits 16
dataloader_drop_last False
dataloader_num_workers 0
double_quant True
early_stopping_patience 3
early_stopping_threshold 0
epoch 1
eval_accumulation_steps None
eval_steps 20
evaluation_strategy steps
fp16 False
gradient_accumulation_steps 8
gradient_checkpointing True
instruction_tuned True
label_smoothing_factor 0
learning_rate 0.000006
load_best_model_at_end True
logging_first_step False
logging_nan_inf_filter True
logging_steps 8
lora_alpha 16
lora_dropout 0.05
lora_r 8
lr_scheduler_type constant_with_warmup
max_grad_norm 1
max_input_length -1
max_steps -1
max_train_samples -1
max_val_samples -1
peft_type lora
per_device_eval_batch_size 8
per_device_train_batch_size 2
preprocessing_num_workers None
quant_type nf4
sagemaker_container_log_level 20
sagemaker_job_name "jumpstart-dft-huggingface-llm-mistr-20231220-090005"
sagemaker_program "transfer_learning.py"
sagemaker_region "us-west-2"
sagemaker_submit_directory "/opt/ml/input/data/code/sourcedir.tar.gz"
save_steps 500
save_strategy steps
save_total_limit 1
seed 10
train_data_split_seed 0
train_from_scratch False
validation_split_ratio 0.2
warmup_ratio 0.1
warmup_steps 0
weight_decay 0.2
In all of my trials, I've only changed the following Hyperparameters Peft Type : None -> lora Lora R dimension : 64 -> 8 Lora Dropout : 0 -> 0.05 I've also tried using default value for Lora R and Lora Dropout
Everything else is just set to default including Training Instance(default=ml.g5.24xlarge). The thing is that I'm getting the same error even though I'm using everything as default except for {peft type : lora}.
I can't figure out the way to solve this problem since the DeepSpeed parameter parsing is not on my hand, I'm just using the UI. Other LLMs like Llama 2 fine-tuning works fine Give me any clues plz.
- 最新
- 投票最多
- 评论最多
Hi, I suggest to use block quotes to have a pretty display of your messages, config. Posting them as regular text makes them quite unreadable on our side.
@Didier_Durand thx for mentioning. I've fixed the post.
@posted Just FYSA, adding a cross reference to a similar issue we are experiencing which has been logged with the AWS SageMaker Feedback repo on GitHub.
https://github.com/aws/amazon-sagemaker-feedback/issues/24
@R J Lewis thanks a lot!!