Neuron model loads when compiled for 1 core but fails to load when compiled for 4

0

Hello, We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model loads fine and is able to run inference on inferentia without issue. However, after compiling a model for multi-core using compiler-args=['--neuroncore-pipeline-cores', '4'] (which takes ~16hrs on a r6a.16xl) the model errors out while loading into memory on the inferentia box. Here's the error message:

2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 589824
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:copy_and_stage_mr_one_channel           Failed to allocate aligned (0) buffer in MLA DRAM for W10-t of size 589824 bytes, channel 0
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:kbl_model_add                           copy_and_stage_mr() error
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 16777216
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dma_ring_alloc                          Failed to allocate RX ring
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:drs_create_data_refill_rings            Failed to allocate pring for data refill dma
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:kbl_model_add                           create_data_refill_rings() error
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2022-Nov-22 22:29:26.0354 20764:20764 ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-a.json to NeuronCore
2022-Nov-22 22:29:26.0364 20764:20764 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmpab7oth00, err: 4
Traceback (most recent call last):
  File "infer_test.py", line 34, in <module>
    model_neuron = torch.jit.load('model-4c.pt')
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
    script_module = jit_load(*args, **kwargs)
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Could not load the model status=4 message=Allocation Failure

Any help would be appreciated.

asked a year ago373 views
1 Answer
0

Hi - can you confirm the type of Inf1 instance you are using for this and if you are using any container configuration? Also - how many cores have you assigned to the process ? I am looking to ensure you have assigned sufficient cores to the process (as shown here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration )

AWS
answered a year ago
  • This is being ran on an inf1.2xl box currently, just in a one-off dev config with the DL AMI. I have not changed the env vars, based on the documentation it would seem like by default all cores would be assigned to the process. For a 4-core inf1.2xl box, what values would make sense for visible_cores and num_cores with a 100% utilization target?

  • After testing with different values, changing visible_cores to 0-3 and num_cores to 4 (and everything in between) did not make any difference - the same error still occurs.

  • Would you be able to share more details on the model you are attempting to compile? It is very unusual to see a 16 hour compilation time which may indicate that there are other issues occurring here even before executing the model.

    Could you potentially share which model is being used or a proxy model that has similar behavior?

    If this is a fully custom/private model, it could be helpful for us to look at a version of the model with the weights set to zero just to see if there are improvements we could make to the compilation process. If you can email steps/files/instructions for reproduction directly to aws-neuron-support@amazon.com then we can take a look.

  • After some T&E we came to the conclusion that neuron doesn't like to compile/run CV models at higher resolutions. We ended up tiling our inputs which appears to be working much better.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions