Neuron model loads when compiled for 1 core but fails to load when compiled for 4

Question

Hello,
We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model loads fine and is able to run inference on inferentia without issue. However, after compiling a model for multi-core using `compiler-args=['--neuroncore-pipeline-cores', '4']` (which takes ~16hrs on a r6a.16xl) the model errors out while loading into memory on the inferentia box. Here's the error message:

```
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 589824
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:copy_and_stage_mr_one_channel           Failed to allocate aligned (0) buffer in MLA DRAM for W10-t of size 589824 bytes, channel 0
2022-Nov-22 22:29:25.0728 20764:22801 ERROR  TDRV:kbl_model_add                           copy_and_stage_mr() error
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dmem_alloc                              Failed to alloc DEVICE memory: 16777216
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:dma_ring_alloc                          Failed to allocate RX ring
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:drs_create_data_refill_rings            Failed to allocate pring for data refill dma
2022-Nov-22 22:29:26.0091 20764:22799 ERROR  TDRV:kbl_model_add                           create_data_refill_rings() error
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0116 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:remove_model                            Unknown model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  TDRV:kbl_model_remove                        Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR  NMGR:dlr_kelf_stage                          Failed to load subgraph
2022-Nov-22 22:29:26.0354 20764:20764 ERROR  NMGR:stage_kelf_models                       Failed to stage graph: kelf-a.json to NeuronCore
2022-Nov-22 22:29:26.0364 20764:20764 ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmpab7oth00, err: 4
Traceback (most recent call last):
  File "infer_test.py", line 34, in 
    model_neuron = torch.jit.load('model-4c.pt')
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
    script_module = jit_load(*args, **kwargs)
  File "/root/pytorch_venv/lib64/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Could not load the model status=4 message=Allocation Failure
```
Any help would be appreciated.

Answer

Hi - can you confirm the type of Inf1 instance you are using for this and if you are using any container configuration? Also - how many cores have you assigned to the process ? I am looking to ensure you have assigned sufficient cores to the process  (as shown here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/nrt-configurable-parameters.html#nrt-configuration )

Neuron model loads when compiled for 1 core but fails to load when compiled for 4

Relevant content