Training Samples Sized 1.8 GB in Batches of 2, but EC2 instance p4d.24xlarge runs out of memory on first step

0

Like the title, I am training a generative model on a p4d.24xlarge, which has 320 GB of GPU memory. The batch is around 4 GB, but the instance errors our with out of memory error, failure to allocate memory. The way I train is as follows:

  1. Initialize instance
  2. ssh in, generate key, and clone repo from gitlab
  3. cd into directory with train.py. source the tensorflow environment, courtesy of the AMI Amzon Linux 2 with Tensorflow 2.16
  4. pip install 2 needed packages.
  5. kick off training with 'python3 train.py' It then fails because it is out of memory, but I am using a generator with batch sized at 4 Gb so I don't understand how that is consuming 320 GB of memory on the first train step. The model has order 10^6 parameters.

train code: ` strategy = tf.distribute.MirroredStrategy()

with strategy.scope():

    model = unetpp_lite(input_layer_shape, num_classes, initial_filters, dropout_rate, l1_reg, l2_reg, alpha = 0.1, n_bridges=n_bridges)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr_schedule), 

               loss=tf.keras.losses.BinaryCrossentropy(),

                metrics=['accuracy', csi, far, pod],run_eagerly=True)

history = model.fit(train_dataset, validation_data=val_dataset, epochs=epochs, steps_per_epoch=len(train_files) // batch_size, verbose=1, callbacks=[checkpoint, tensorboard, csv_logger])

`

Is there a certain setup that needs to be done before training? Do I need to install drivers? Additionally, how dod you stop training? Can you ctl+C? If I use kill PID, it doesn't seem to work. Memory still is alocated to the GPUs if I run nvidia-smi.

I also get the warning below, but it doesn't appear to me to be problematic:

2024-09-02 02:58:38.959186: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-09-02 02:58:38.961433: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

2 Answers
0

I'll provide an answer to the issue of memory errors during generative model training on a p4d.24xlarge instance. First, consuming all 320GB of GPU memory with a 4GB batch size is unusual. This suggests there might be issues with the model structure or data processing method. Regarding pre-training setup, ensure NVIDIA drivers and CUDA are correctly installed. The provided output indicates nvidia-smi command isn't working, which could point to driver installation problems. Training can typically be stopped with Ctrl+C, but if GPU memory isn't released, you may need to fully exit the Python interpreter or restart the instance.

To address memory issues, consider these approaches.

  1. Simplify the model structure or reduce parameter count.
  2. Decrease batch size further.
  3. Implement gradient accumulation.
  4. Use memory-efficient options like mixed precision training.

Lastly, the NUMA-related warning usually doesn't significantly impact performance and can be ignored.

If problems persist after trying these methods, closely review your model structure and data processing logic.

AWS
answered 4 months ago
profile picture
EXPERT
reviewed 4 months ago
0

Thank you. If my dataset is 500Gb to 1 TB in total, what would be my best option for training quickly on this dataset? Combining EC2 instances together and using multiple P5's, or is the best option just to lower your batch size?

answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions