Hello,
We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model loads fine and is able to run inference on inferentia without issue. However, after compiling a model for multi-core using compiler-args=['--neuroncore-pipeline-cores', '4']
(which takes ~16hrs on a r6a.16xl) the model errors out while loading into memory on the inferentia box. Here's the error message:
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:dmem_alloc Failed to alloc DEVICE memory: 589824
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:copy_and_stage_mr_one_channel Failed to allocate aligned (0) buffer in MLA DRAM for W10-t of size 589824 bytes, channel 0
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:kbl_model_add copy_and_stage_mr() error
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:dmem_alloc Failed to alloc DEVICE memory: 16777216
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:dma_ring_alloc Failed to allocate RX ring
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:drs_create_data_refill_rings Failed to allocate pring for data refill dma
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:kbl_model_add create_data_refill_rings() error
2022-Nov-22 22:29:26.0116 20764:20764 ERROR TDRV:remove_model Unknown model: 1001
2022-Nov-22 22:29:26.0116 20764:20764 ERROR TDRV:kbl_model_remove Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR TDRV:remove_model Unknown model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR TDRV:kbl_model_remove Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR NMGR:dlr_kelf_stage Failed to load subgraph
2022-Nov-22 22:29:26.0354 20764:20764 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-a.json to NeuronCore
2022-Nov-22 22:29:26.0364 20764:20764 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmpab7oth00, err: 4
Traceback (most recent call last):
File "infer_test.py", line 34, in <module>
model_neuron = torch.jit.load('model-4c.pt')
File "/root/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
script_module = jit_load(*args, **kwargs)
File "/root/pytorch_venv/lib64/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Could not load the model status=4 message=Allocation Failure
Any help would be appreciated.
This is being ran on an inf1.2xl box currently, just in a one-off dev config with the DL AMI. I have not changed the env vars, based on the documentation it would seem like by default all cores would be assigned to the process. For a 4-core inf1.2xl box, what values would make sense for visible_cores and num_cores with a 100% utilization target?
After testing with different values, changing visible_cores to 0-3 and num_cores to 4 (and everything in between) did not make any difference - the same error still occurs.
Would you be able to share more details on the model you are attempting to compile? It is very unusual to see a 16 hour compilation time which may indicate that there are other issues occurring here even before executing the model.
Could you potentially share which model is being used or a proxy model that has similar behavior?
If this is a fully custom/private model, it could be helpful for us to look at a version of the model with the weights set to zero just to see if there are improvements we could make to the compilation process. If you can email steps/files/instructions for reproduction directly to aws-neuron-support@amazon.com then we can take a look.
After some T&E we came to the conclusion that neuron doesn't like to compile/run CV models at higher resolutions. We ended up tiling our inputs which appears to be working much better.