Questions tagged with AWS Inferentia
Content language: English
Sort by most recent
Hello,
Difference between AWS Trainium, AWS inferentia and Instances with Habana Accelerators?
Thanks,
I converted a pytorch BERT model to neuron. However the embedding or output tensors which is a list of 1024 size is different..i.e the list sizes are same but individual entries differ. Each of the numbers differs around 1-5% with the original pytorch model output.
This is the code that I use to neuron compile the pytorch model.
```
from transformers import BertTokenizer, BertModel
import torch
import torch_neuron
import os.path
import os
import numpy
tokenizer = BertTokenizer.from_pretrained(modelname, model_max_length=512)
input_str = "The patient's ability is determined based on patients medical parameters, patients history of ability to attend a remote clinician sessions, and physical parameters. Based on identified parameters a patients profile score is calculated to determine patients ability to attend the remote clinician session."
inputs = tokenizer(input_str, padding='max_length', return_tensors="pt")
PATH = './ptmodel/'
fname = 'modelneuron.pt'
kwargs = {'compiler_args':['--fast-math', 'none','--neuroncore-pipeline-cores', '1']}
model = BertModel.from_pretrained(PATH, local_files_only=True, return_dict=False)
neuron_model = torch_neuron.trace(model,
example_inputs = (inputs['input_ids'],inputs['attention_mask'],inputs['token_type_ids']), **kwargs)
neuron_model.save(fname)
```
Has anyone faced this issue or knows how to solve this??
Thanks
Ajay
Hello,
We are testing the pipeline mode for neuron/inferentia, but can not get a model running for multi-core. The single core compiled model loads fine and is able to run inference on inferentia without issue. However, after compiling a model for multi-core using `compiler-args=['--neuroncore-pipeline-cores', '4']` (which takes ~16hrs on a r6a.16xl) the model errors out while loading into memory on the inferentia box. Here's the error message:
```
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:dmem_alloc Failed to alloc DEVICE memory: 589824
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:copy_and_stage_mr_one_channel Failed to allocate aligned (0) buffer in MLA DRAM for W10-t of size 589824 bytes, channel 0
2022-Nov-22 22:29:25.0728 20764:22801 ERROR TDRV:kbl_model_add copy_and_stage_mr() error
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:dmem_alloc Failed to alloc DEVICE memory: 16777216
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:dma_ring_alloc Failed to allocate RX ring
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:drs_create_data_refill_rings Failed to allocate pring for data refill dma
2022-Nov-22 22:29:26.0091 20764:22799 ERROR TDRV:kbl_model_add create_data_refill_rings() error
2022-Nov-22 22:29:26.0116 20764:20764 ERROR TDRV:remove_model Unknown model: 1001
2022-Nov-22 22:29:26.0116 20764:20764 ERROR TDRV:kbl_model_remove Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR TDRV:remove_model Unknown model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR TDRV:kbl_model_remove Failed to find and remove model: 1001
2022-Nov-22 22:29:26.0117 20764:20764 ERROR NMGR:dlr_kelf_stage Failed to load subgraph
2022-Nov-22 22:29:26.0354 20764:20764 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-a.json to NeuronCore
2022-Nov-22 22:29:26.0364 20764:20764 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmpab7oth00, err: 4
Traceback (most recent call last):
File "infer_test.py", line 34, in <module>
model_neuron = torch.jit.load('model-4c.pt')
File "/root/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/jit_load_wrapper.py", line 13, in wrapper
script_module = jit_load(*args, **kwargs)
File "/root/pytorch_venv/lib64/python3.7/site-packages/torch/jit/_serialization.py", line 162, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Could not load the model status=4 message=Allocation Failure
```
Any help would be appreciated.
Im using the following code to load a neuron compiled model for inference. However on my inf1.2xlarge instance, neuron-top shows for cores (NC0 to NC3). Only NC0 gets used in inference. How do I increase throughput by using all cores???
```
from transformers import BertTokenizer, BertModel
import torch
import torch_neuron
import os.path
import os
os.environ['NEURON_RT_NUM_CORES']=str(4)
fname = 'modelneuron.pt'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute and big", return_tensors="pt")
if not os.path.isfile(fname):
model = BertModel.from_pretrained('bert-base-uncased', return_dict=False)
neuron_model = torch_neuron.trace(model,
example_inputs = (inputs['input_ids'],inputs['attention_mask']))
neuron_model.save("modelneuron.pt")
print('saved neuron model')
else:
neuron_model = torch.jit.load('modelneuron.pt')
print('loaded neuron model')
for i in range(10000):
outputs = neuron_model(*(inputs['input_ids'],inputs['attention_mask']))
print(outputs)
Hi,
I want to neuron compile a bert large model(patentbert from google) which has sequence length 512. How do I do this?
Also I want to call the model as before or need to know what I should change while calling it.
I compiled the anferico/bert-for-patents using the instructions here https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html.
Also is it the same to use patentbert from google and huggingface anferico/bert-for-patents ??
The saved model cli output for
1. Google's patentbert
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_ids'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: input_ids:0
inputs['input_mask'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: input_mask:0
inputs['mlm_positions'] tensor_info:
dtype: DT_INT64
shape: (-1, 45)
name: mlm_positions:0
inputs['segment_ids'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: segment_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['cls_token'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1024)
name: Squeeze:0
outputs['encoder_layer'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 512, 1024)
name: bert/encoder/layer_23/output/LayerNorm/batchnorm/add_1:0
outputs['mlm_logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 39859)
name: cls/predictions/BiasAdd:0
outputs['next_sentence_logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: cls/seq_relationship/BiasAdd:0
Method name is: tensorflow/serving/predict
MetaGraphDef with tag-set: 'serve, tpu' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_ids'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: input_ids:0
inputs['input_mask'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: input_mask:0
inputs['mlm_positions'] tensor_info:
dtype: DT_INT64
shape: (-1, 45)
name: mlm_positions:0
inputs['segment_ids'] tensor_info:
dtype: DT_INT64
shape: (-1, 512)
name: segment_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['cls_token'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: TPUPartitionedCall:5
outputs['encoder_layer'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: TPUPartitionedCall:6
outputs['mlm_logits'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: TPUPartitionedCall:4
outputs['next_sentence_logits'] tensor_info:
dtype: DT_FLOAT
shape: unknown_rank
name: TPUPartitionedCall:7
Method name is: tensorflow/serving/predict
2. Neuron compiled anferico/bert-for-patents
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['__saved_model_init_op']:
The given SavedModel SignatureDef contains the following input(s):
The given SavedModel SignatureDef contains the following output(s):
outputs['__saved_model_init_op'] tensor_info:
dtype: DT_INVALID
shape: unknown_rank
name: NoOp
Method name is:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input_1'] tensor_info:
dtype: DT_INT32
shape: (-1, 512)
name: serving_default_input_1:0
inputs['input_2'] tensor_info:
dtype: DT_INT32
shape: (-1, 512)
name: serving_default_input_2:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 2)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
Concrete Functions:
Function Name: '__call__'
Option #1
Callable with:
Argument #1
DType: list
Value: [TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_1'), TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_2')]
Function Name: '_default_save_signature'
Option #1
Callable with:
Argument #1
DType: list
Value: [TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_1'), TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_2')]
Function Name: 'aws_neuron_function'
Option #1
Callable with:
Argument #1
args_0
Argument #2
args_0_1
Function Name: 'call_and_return_all_conditional_losses'
Option #1
Callable with:
Argument #1
DType: list
Value: [TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_1'), TensorSpec(shape=(None, 512), dtype=tf.int32, name='input_2')]
This is the stack trace when I replace the original bert model with neuron compiled model.
Traceback (most recent call last):
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1645, in _call_with_flat_signature
args.append(kwargs.pop(compat.as_str(keyword)))
KeyError: 'input_1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1615, in _call_impl
cancellation_manager)
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1651, in _call_with_flat_signature
raise TypeError(f"{self._flat_signature_summary()} missing required "
TypeError: signature_wrapper(input_1, input_2) missing required arguments: input_1, input_2.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 319, in <module>
verbose)
File "main.py", line 223, in process_each_project
verbose=verbose)
File "/home/ubuntu/ranking_pipeline/rank_utils.py", line 437, in rank
response, inputs, _ = self.model.predict(search_sentences)
File "/home/ubuntu/ranking_pipeline/bert_utils.py", line 292, in predict
inputs['mlm_ids'], dtype=tf.int64),
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1601, in __call__
return self._call_impl(args, kwargs)
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1617, in _call_impl
raise structured_err
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1611, in _call_impl
cancellation_manager)
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1688, in _call_with_structured_signature
self._structured_signature_check_missing_args(args, kwargs)
File "/home/ubuntu/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1707, in _structured_signature_check_missing_args
raise TypeError(f"{self._structured_signature_summary()} missing "
TypeError: signature_wrapper(*, input_2, input_1) missing required arguments: input_1, input_2.
Thanks in advance
Warm regards
Ajay
I am trying to load a neuron compiled model generated as given in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html . I am still a newbie so please excuse my mistakes. This is my code for loading a neuron compiled model.It is almost entirely based on the code in the page referred earlier.
from transformers import pipeline
import tensorflow as tf
import tensorflow.neuron as tfn
class TFBertForSequenceClassificationDictIO(tf.keras.Model):
def __init__(self, model_wrapped):
super().__init__()
self.model_wrapped = model_wrapped
self.aws_neuron_function = model_wrapped.aws_neuron_function
def call(self, inputs):
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
logits = self.model_wrapped([input_ids, attention_mask])
return [logits]
class TFBertForSequenceClassificationFlatIO(tf.keras.Model):
def __init__(self, model):
super().__init__()
self.model = model
def call(self, inputs):
input_ids, attention_mask = inputs
output = self.model({'input_ids': input_ids, 'attention_mask': attention_mask})
return output['logits']
string_inputs = [
'I love to eat pizza!',
'I am sorry. I really want to like it, but I just can not stand sushi.',
'I really do not want to type out 128 strings to create batch 128 data.',
'Ah! Multiplying this list by 32 would be a great solution!',
]
string_inputs = string_inputs * 32
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
neuron_pipe = pipeline('sentiment-analysis', model=model_name, framework='tf')
example_inputs = neuron_pipe.tokenizer(string_inputs)
pipe = pipeline('sentiment-analysis', model=model_name, framework='tf')
reloaded_model = tf.keras.models.load_model('./distilbert_b128_2')
model_wrapped = TFBertForSequenceClassificationFlatIO(pipe.model)
example_inputs_list = [example_inputs['input_ids'], example_inputs['attention_mask']]
model_wrapped_traced = tfn.trace(model_wrapped, example_inputs_list)
rewrapped_model = TFBertForSequenceClassificationDictIO(model_wrapped_traced)
This is the stacktrace
2022-11-05 02:46:55.553817: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.
All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Traceback (most recent call last):
File "inferencesmall.py", line 40, in <module>
model_wrapped_traced = tfn.trace(model_wrapped, example_inputs_list)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow_neuron/python/_trace.py", line 167, in trace
func = func.get_concrete_function(*example_inputs)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 1264, in get_concrete_function
concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 1244, in _get_concrete_function_garbage_collected
self._initialize(args, kwargs, add_initializers_to=initializers)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 786, in _initialize
*args, **kwds))
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2983, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3292, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3140, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 1161, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 677, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 1147, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/keras/engine/base_layer.py", line 987, in error_handler *
return fn(*args, **kwargs)
StagingError: Exception encountered when calling layer "tf_bert_for_sequence_classification_flat_io" (type TFBertForSequenceClassificationFlatIO).
in user code:
File "inferencesmall.py", line 22, in call *
output = self.model({'input_ids': input_ids, 'attention_mask': attention_mask})
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler **
raise e.with_traceback(filtered_tb) from None
StagingError: Exception encountered when calling layer "tf_distil_bert_for_sequence_classification_1" (type TFDistilBertForSequenceClassification).
in user code:
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 798, in call *
distilbert_output = self.distilbert(
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler **
raise e.with_traceback(filtered_tb) from None
StagingError: Exception encountered when calling layer "distilbert" (type TFDistilBertMainLayer).
in user code:
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/transformers/models/distilbert/modeling_tf_distilbert.py", line 423, in call *
inputs = input_processing(
File "/home/ubuntu/huggingface/venv/lib/python3.7/site-packages/transformers/modeling_tf_utils.py", line 372, in input_processing *
output[parameter_names[i]] = input
IndexError: list index out of range
Thanks in advance
Ajay
Hi,
This link https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/tensorflow/tensorflow-neuron/tutorials/bert_demo/bert_demo.html mentions how to compile using tensorflow 1. Can anyone let me know the steps to neuron compile a BERT large model for running inference on inferentia using tensorflow v2??
Thanks in advance Ajay
P.S This is what my log looks like while compiling on tf1 INFO:tensorflow:fusing subgraph {subgraph neuron_op_e76ab3d9bc74f09f with input tensors ["<tf.Tensor 'bert/encoder/ones0/_0:0' shape=(1, 512, 1) dtype=float32>", "<tf.Tensor 'bert/encoder/Cast0/_1:0' shape=(1, 1, 512) dtype=float32>", "<tf.Tensor 'bert/embeddings/LayerNorm/batchnorm/add_10/_2:0' shape=(1, 512, 1024) dtype=float32>"], output tensors ["<tf.Tensor 'bert/pooler/dense/Tanh:0' shape=(1, 1024) dtype=float32>", "<tf.Tensor 'bert/encoder/layer_23/output/LayerNorm/batchnorm/add_1:0' shape=(1, 512, 1024) dtype=float32>"]} with neuron-cc . Compiler status ERROR WARNING:tensorflow:11/03/2022 04:28:48 AM ERROR 9932 [neuron-cc]: Failed to parse model /tmp/tmpbyvnmr6h/neuron_op_e76ab3d9bc74f09f/graph_def.pb: The following operators are not implemented: {'Einsum'} (NotImplementedError)
INFO:tensorflow:Number of operations in TensorFlow session: 7427 INFO:tensorflow:Number of operations after tf.neuron optimizations: 2901 INFO:tensorflow:Number of operations placed on Neuron runtime: 0
WARNING:tensorflow:Converted /home/ubuntu/bert_repo/patent_model/ to ./bert-saved-model-neuron_tf1.15 but no operator will be running on AWS machine learning accelerators. This is probably not what you want. Please refer to https://github.com/aws/aws-neuron-sdk for current limitations of the AWS Neuron SDK. We are actively improving (and hiring)! {'OnNeuronRatio': 0.0}
---I assume the OnNeuronRatio being 0 means that I wont be able to make use of Inferentia hardware acceleration. Is that correct?
I followed user guide on updating torch neuron and then started compiling the model to neuron.
But got an error, from which I don't understand what's wrong.
In Neuron SDK you claim that it should compile all operations, even not supported ones, they just should run on CPU.
The error:
```
INFO:Neuron:All operators are compiled by neuron-cc (this does not guarantee that neuron-cc will successfully compile)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 3345, fused = 3345, percent fused = 100.0%
INFO:Neuron:Number of neuron graph operations 8175 did not match traced graph 9652 - using heuristic matching of hierarchical information
INFO:Neuron:Compiling function _NeuronGraph$3362 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config {"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]} --verbose 35'
..............................................................................INFO:Neuron:Compile command returned: -9
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$3362; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]}' --verbose 35
Traceback (most recent call last):
File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 382, in op_converter
item, inputs, compiler_workdir=sg_workdir, **kwargs)
File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/decorators.py", line 220, in trace
'neuron-cc failed with the following command line call:\n{}'.format(command))
subprocess.SubprocessError: neuron-cc failed with the following command line call:
/home/ubuntu/alias/neuron/neuron_env/bin/neuron-cc compile /tmp/tmpmp8qvhtb/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpmp8qvhtb/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 3, 768, 768], "float32"]}, "outputs": ["aten_sigmoid/Sigmoid:0"]}' --verbose 35
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 3345, compiled = 0, percent compiled = 0.0%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 942 [supported]
INFO:Neuron: => aten::_convolution: 107 [supported]
INFO:Neuron: => aten::add: 104 [supported]
INFO:Neuron: => aten::batch_norm: 1 [supported]
INFO:Neuron: => aten::cat: 1 [supported]
INFO:Neuron: => aten::contiguous: 4 [supported]
INFO:Neuron: => aten::div: 104 [supported]
INFO:Neuron: => aten::dropout: 208 [supported]
INFO:Neuron: => aten::feature_dropout: 1 [supported]
INFO:Neuron: => aten::flatten: 60 [supported]
INFO:Neuron: => aten::gelu: 52 [supported]
INFO:Neuron: => aten::layer_norm: 161 [supported]
INFO:Neuron: => aten::linear: 264 [supported]
INFO:Neuron: => aten::matmul: 104 [supported]
INFO:Neuron: => aten::mul: 52 [supported]
INFO:Neuron: => aten::permute: 210 [supported]
INFO:Neuron: => aten::relu: 1 [supported]
INFO:Neuron: => aten::reshape: 262 [supported]
INFO:Neuron: => aten::select: 104 [supported]
INFO:Neuron: => aten::sigmoid: 1 [supported]
INFO:Neuron: => aten::size: 278 [supported]
INFO:Neuron: => aten::softmax: 52 [supported]
INFO:Neuron: => aten::transpose: 216 [supported]
INFO:Neuron: => aten::upsample_bilinear2d: 4 [supported]
INFO:Neuron: => aten::view: 52 [supported]
Traceback (most recent call last):
File "to_neuron.py", line 14, in <module>
model_neuron = torch.neuron.trace(model, example_inputs=[image.cuda()])
File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 184, in trace
cu.stats_post_compiler(neuron_graph)
File "/home/ubuntu/alias/neuron/neuron_env/lib/python3.7/site-packages/torch_neuron/convert.py", line 493, in stats_post_compiler
"No operations were successfully partitioned and compiled to neuron for this model - aborting trace!")
RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!
```
I'm following some guides and from my understanding this should be possible. But I've been trying for hours to compile a yolov5 model into a neuron model with no success. Is it even possible to do this in my local machine or do I have to be in an inferentia instance?
This is what my environment looks like:
```
# packages in environment at /miniconda3/envs/neuron:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.2.0 pypi_0 pypi
astor 0.8.1 pypi_0 pypi
attrs 22.1.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
ca-certificates 2022.07.19 h06a4308_0
cachetools 5.2.0 pypi_0 pypi
certifi 2022.9.24 pypi_0 pypi
charset-normalizer 2.1.1 pypi_0 pypi
cycler 0.11.0 pypi_0 pypi
debugpy 1.5.1 py37h295c915_0
decorator 5.1.1 pyhd3eb1b0_0
dmlc-nnvm 1.11.1.0+0 pypi_0 pypi
dmlc-topi 1.11.1.0+0 pypi_0 pypi
dmlc-tvm 1.11.1.0+0 pypi_0 pypi
entrypoints 0.4 py37h06a4308_0
fonttools 4.37.3 pypi_0 pypi
gast 0.2.2 pypi_0 pypi
google-auth 2.12.0 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
gputil 1.4.0 pypi_0 pypi
grpcio 1.49.1 pypi_0 pypi
h5py 3.7.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
importlib-metadata 4.12.0 pypi_0 pypi
inferentia-hwm 1.11.0.0+0 pypi_0 pypi
iniconfig 1.1.1 pypi_0 pypi
ipykernel 6.15.2 py37h06a4308_0
ipython 7.34.0 pypi_0 pypi
ipywidgets 8.0.2 pypi_0 pypi
islpy 2021.1+aws2021.x.16.0.bld0 pypi_0 pypi
jedi 0.18.1 py37h06a4308_1
jupyter_client 7.3.5 py37h06a4308_0
jupyter_core 4.10.0 py37h06a4308_0
jupyterlab-widgets 3.0.3 pypi_0 pypi
keras-applications 1.0.8 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.4.4 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 11.2.0 h1234567_1
llvmlite 0.39.1 pypi_0 pypi
markdown 3.4.1 pypi_0 pypi
markupsafe 2.1.1 pypi_0 pypi
matplotlib 3.5.3 pypi_0 pypi
matplotlib-inline 0.1.6 py37h06a4308_0
ncurses 6.3 h5eee18b_3
nest-asyncio 1.5.5 py37h06a4308_0
networkx 2.4 pypi_0 pypi
neuron-cc 1.11.7.0+aec18907e pypi_0 pypi
numba 0.56.2 pypi_0 pypi
numpy 1.19.5 pypi_0 pypi
oauthlib 3.2.1 pypi_0 pypi
opencv-python 4.6.0.66 pypi_0 pypi
openssl 1.1.1q h7f8727e_0
opt-einsum 3.3.0 pypi_0 pypi
packaging 21.3 pyhd3eb1b0_0
pandas 1.3.5 pypi_0 pypi
parso 0.8.3 pyhd3eb1b0_0
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 9.2.0 pypi_0 pypi
pip 22.2.2 pypi_0 pypi
pluggy 1.0.0 pypi_0 pypi
prompt-toolkit 3.0.31 pypi_0 pypi
protobuf 3.20.3 pypi_0 pypi
psutil 5.9.2 pypi_0 pypi
ptyprocess 0.7.0 pyhd3eb1b0_2
py 1.11.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pygments 2.13.0 pypi_0 pypi
pyparsing 3.0.9 py37h06a4308_0
pytest 7.1.3 pypi_0 pypi
python 3.7.13 h12debd9_0
python-dateutil 2.8.2 pyhd3eb1b0_0
pytz 2022.2.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 23.2.0 py37h6a678d5_0
readline 8.1.2 h7f8727e_1
requests 2.28.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rsa 4.9 pypi_0 pypi
scipy 1.4.1 pypi_0 pypi
seaborn 0.12.0 pypi_0 pypi
setuptools 59.8.0 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_1
sqlite 3.39.3 h5082296_0
tensorboard 1.15.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tensorflow 1.15.0 pypi_0 pypi
tensorflow-estimator 1.15.1 pypi_0 pypi
termcolor 2.0.1 pypi_0 pypi
thop 0.1.1-2209072238 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tomli 2.0.1 pypi_0 pypi
torch 1.11.0 pypi_0 pypi
torch-neuron 1.11.0.2.3.0.0 pypi_0 pypi
torchvision 0.12.0 pypi_0 pypi
tornado 6.2 py37h5eee18b_0
tqdm 4.64.1 pypi_0 pypi
traitlets 5.4.0 pypi_0 pypi
typing-extensions 4.3.0 pypi_0 pypi
urllib3 1.26.12 pypi_0 pypi
wcwidth 0.2.5 pyhd3eb1b0_0
werkzeug 2.2.2 pypi_0 pypi
wheel 0.37.1 pypi_0 pypi
widgetsnbextension 4.0.3 pypi_0 pypi
wrapt 1.14.1 pypi_0 pypi
xz 5.2.6 h5eee18b_0
zeromq 4.3.4 h2531618_0
zipp 3.8.1 pypi_0 pypi
zlib 1.2.12 h5eee18b_3
```
Hi Team,
I wanted to compile a BERT model and run it on inferentia. I trained my model using pytorch and tried to convert it by following the same steps in this [tutorial](https://github.com/aws-neuron/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb) on my amazon linux Machine. But I keep getting failure with this error:
```
09/22/2022 06:13:56 PM ERROR 23737 [neuron-cc]: Failed to parse model /tmp/tmp64l9ygmj/graph_def.pb: The following operators are not implemented: {'SelectV2'} (NotImplementedError)
```
I followed the installation steps [here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/pytorch-setup/pytorch-install.html) for pytorch-1.11.0 and tried to execute the code in tutorial but got the same error.
We wanted to explore using Inferentia for our large BERT model but are blocked on doing so due to failure in conversion to NEFF format. I also tried following steps using TF and ran into some other ops unsupported issue. Could you please help!
Below are the setup commands i ran on my Amazon Linux Desktop
```
sudo yum install -y python3.7-venv gcc-c++
python3.7 -m venv pytorch_venv
source pytorch_venv/bin/activate
pip install -U pip
# Set Pip repository to point to the Neuron repository
pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
#Install Neuron PyTorch
pip install torch-neuron neuron-cc[tensorflow] "protobuf<4" torchvision
!pip install --upgrade "transformers==4.6.0"
pip install tensorflow==2.8.1
```
and then executed the below script(copied from tutorial) on my amazon linux host:
```
import tensorflow # to workaround a protobuf version conflict issue
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import transformers
import os
import warnings
# Setting up NeuronCore groups for inf1.6xlarge with 16 cores
num_cores = 16 # This value should be 4 on inf1.xlarge and inf1.2xlarge
nc_env = ','.join(['1'] * num_cores)
warnings.warn("NEURONCORE_GROUP_SIZES is being deprecated, if your application is using NEURONCORE_GROUP_SIZES please \
see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/deprecation.html#announcing-end-of-support-for-neuroncore-group-sizes \
for more details.", DeprecationWarning)
os.environ['NEURONCORE_GROUP_SIZES'] = nc_env
# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)
# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
# Run the original PyTorch model on compilation exaple
paraphrase_classification_logits = model(**paraphrase)[0]
# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']
# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)
```
This gave me the following error:
```
2022-09-22 18:13:12.145617: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-22 18:13:12.145649: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
sample_pytorch_model.py:14: DeprecationWarning: NEURONCORE_GROUP_SIZES is being deprecated, if your application is using NEURONCORE_GROUP_SIZES please see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/deprecation.html#announcing-end-of-support-for-neuroncore-group-sizes for more details.
for more details.", DeprecationWarning)
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 641kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 636kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 731kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 35.2kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████| 433M/433M [00:09<00:00, 45.7MB/s]
/local/home/spareek/pytorch_venv/lib64/python3.7/site-packages/transformers/modeling_utils.py:1968: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
input_tensor.shape[chunk_dim] == tensor_shape for input_tensor in input_tensors
INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://github.com/aws/aws-neuron-sdk/blob/master/release-notes/neuron-cc-ops/neuron-cc-ops-pytorch.md)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 565, fused = 548, percent fused = 96.99%
INFO:Neuron:Number of neuron graph operations 1601 did not match traced graph 1323 - using heuristic matching of hierarchical information
INFO:Neuron:Compiling function _NeuronGraph$662 with neuron-cc
INFO:Neuron:Compiling with command line: '/local/home/spareek/pytorch_venv/bin/neuron-cc compile /tmp/tmp64l9ygmj/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp64l9ygmj/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]} --verbose 35'
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
.2022-09-22 18:13:52.697717: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-22 18:13:52.697749: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
09/22/2022 06:13:56 PM ERROR 23737 [neuron-cc]: Failed to parse model /tmp/tmp64l9ygmj/graph_def.pb: The following operators are not implemented: {'SelectV2'} (NotImplementedError)
Compiler status ERROR
INFO:Neuron:Compile command returned: 1
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$662; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:
/local/home/spareek/pytorch_venv/bin/neuron-cc compile /tmp/tmp64l9ygmj/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp64l9ygmj/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]}' --verbose 35
Traceback (most recent call last):
File "/local/home/spareek/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/convert.py", line 382, in op_converter
item, inputs, compiler_workdir=sg_workdir, **kwargs)
File "/local/home/spareek/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/decorators.py", line 220, in trace
'neuron-cc failed with the following command line call:\n{}'.format(command))
subprocess.SubprocessError: neuron-cc failed with the following command line call:
/local/home/spareek/pytorch_venv/bin/neuron-cc compile /tmp/tmp64l9ygmj/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmp64l9ygmj/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]}' --verbose 35
INFO:Neuron:Number of arithmetic operators (post-compilation) before = 565, compiled = 0, percent compiled = 0.0%
INFO:Neuron:The neuron partitioner created 1 sub-graphs
INFO:Neuron:Neuron successfully compiled 0 sub-graphs, Total fused subgraphs = 1, Percent of model sub-graphs successfully compiled = 0.0%
INFO:Neuron:Compiled these operators (and operator counts) to Neuron:
INFO:Neuron:Not compiled operators (and operator counts) to Neuron:
INFO:Neuron: => aten::Int: 97 [supported]
INFO:Neuron: => aten::add: 39 [supported]
INFO:Neuron: => aten::contiguous: 12 [supported]
INFO:Neuron: => aten::div: 12 [supported]
INFO:Neuron: => aten::dropout: 38 [supported]
INFO:Neuron: => aten::embedding: 3 [not supported]
INFO:Neuron: => aten::gelu: 12 [supported]
INFO:Neuron: => aten::layer_norm: 25 [supported]
INFO:Neuron: => aten::linear: 74 [supported]
INFO:Neuron: => aten::matmul: 24 [supported]
INFO:Neuron: => aten::mul: 1 [supported]
INFO:Neuron: => aten::permute: 48 [supported]
INFO:Neuron: => aten::rsub: 1 [supported]
INFO:Neuron: => aten::select: 1 [supported]
INFO:Neuron: => aten::size: 97 [supported]
INFO:Neuron: => aten::slice: 5 [supported]
INFO:Neuron: => aten::softmax: 12 [supported]
INFO:Neuron: => aten::tanh: 1 [supported]
INFO:Neuron: => aten::to: 1 [supported]
INFO:Neuron: => aten::transpose: 12 [supported]
INFO:Neuron: => aten::unsqueeze: 2 [supported]
INFO:Neuron: => aten::view: 48 [supported]
Traceback (most recent call last):
File "sample_pytorch_model.py", line 38, in <module>
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)
File "/local/home/spareek/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/convert.py", line 184, in trace
cu.stats_post_compiler(neuron_graph)
File "/local/home/spareek/pytorch_venv/lib64/python3.7/site-packages/torch_neuron/convert.py", line 493, in stats_post_compiler
"No operations were successfully partitioned and compiled to neuron for this model - aborting trace!")
RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!
```
I am trying to test a model compiled for Inferentia on an `inf1.2xlarge`, but when loading the model I receive the following error messages:
```
2022-Sep-15 22:10:01.0152 3802:3802 ERROR TDRV:dmem_alloc Failed to alloc DEVICE memory: 1073741824
2022-Sep-15 22:10:01.0152 3802:3802 ERROR TDRV:dma_ring_alloc Failed to allocate TX ring
2022-Sep-15 22:10:01.0172 3802:3802 ERROR TDRV:io_create_rings Failed to allocate io ring for queue qPoolOut0_0
2022-Sep-15 22:10:01.0172 3802:3802 ERROR TDRV:kbl_model_add create_io_rings() error
2022-Sep-15 22:10:01.0182 3802:3802 ERROR NMGR:dlr_kelf_stage Failed to load subgraph
2022-Sep-15 22:10:01.0182 3802:3802 ERROR NMGR:stage_kelf_models Failed to stage graph: kelf-a.json to NeuronCore
2022-Sep-15 22:10:01.0184 3802:3802 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: 1.11.7.0+aec18907e-/tmp/tmptkxl8amn, err: 4
```
These are wrapped into a Python runtime exception:
```
RuntimeError: Could not load the model status=4 message=Allocation Failure
```
I presume that this is because the model is on the large size. The `.neff` file is 373MB and takes ~4 hours to compile for a batch size of 1.
This particular model is compiled for a single Neuron core. I am now trying to compile with `--neuroncore-pipeline-cores 4` to spread the model across multiple cores. This however gives me the following log message:
```
INFO: The requested number of neuroncore-pipeline-cores (4) may not be suitable for this network, and may lead to sub-optimal performance. Recommended neuroncore-pipeline-cores for this network is 1.
```
(I can't find any technical details on how much memory an Inferentia chip has, although I'm guessing that due to Inferentia architecture "memory" is not used in the same way as it might be on CPU or GPU.)
So, what is a practical size limit for an Inferentia model and what can I do about running this model on Inf1?
I have compiled my model to run on Inferentia and I can load up multiple models from 1 process such as a single jupyter notebook.
I am trying to host the models via a server and am using gunicorn as the interface. When I specify gunicorn to use anything more than 1 worker, the process crashes and I receive an error like such:
```
2022-Aug-16 00:51:15.0842 22127:22127 ERROR NRT:nrt_allocate_neuron_cores NeuronCore(s) not available - Requested:16 Available:0
```
Gunicorn works with 1 parent process and the number of threads specifies the child processes so, in this case, there are multiple child processes that would like to use 1 core each.
I would like to know if there is any way in which I can have all of the cores utilized by multiple child processes. If there is any documentation around this or a potential solution that may work, that would be greatly appreciated.