Questions tagged with AWS Deep Learning Containers
Content language: English
Sort by most recent
Browse through the questions and answers listed below or filter and sort to narrow down your results.
Hi, with the release of EKS 1.23 and 1.24 we are moving ever closer to the full removal of the PodSecurityPolicy. To replace this the PodSecurity admission controller has been graduated to the beta branch and is now enabled by default. To enforce security we now need to add labels to namespaces, but Kubernetes also allows you to have full control over the AdmissionController's default settings or at least it should according to [this article](https://kubernetes.io/docs/tutorials/security/cluster-level-pss/).
My question is: how do I edit the AdmissionController to change the default settings to be more or less restrictive? I am not a super advanced Kubernetes nerd yet, but I think this AdmissionController configuration is supposed to be done on the Kubernetes API server? If that's true, how do I edit it since EKS hides that from me.
Thanks
Hello,
I've been trying to deploy multiple PyTorch models on one endpoint on SageMaker from a SageMaker Notebook. First I tested deployment of single models on single endpoints, to check if everything works smoothly and it did. I would create a PyTorchModel first:
```
import sagemaker
from sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import boto3
role = get_execution_role()
sagemaker_session = sagemaker.Session()
pytorch_model = PyTorchModel(
entry_point='inference.py',
source_dir='code',
role=role,
model_data='s3://***/model/model.tar.gz',
framework_version='1.11.0',
py_version='py38',
name='***-model',
sagemaker_session=sagemaker_session
)
```
MultiDataModel inherits properties from Model classes, so I used the same PyTorch model that I used for single model deployment.
Then I would define the MultiDataModel the following way:
```
models = MultiDataModel(name='***-multi-model',
model_data_prefix='s3://***-sagemaker/model/',
model=pytorch_model,
sagemaker_session=sagemaker_session
)
```
All it should need is the prefix to the S3 bucket of the model artifacts saved as tar.gz files (the same files used for single model deployment), the previously defined PyTorch model, a name and a sagemaker_session.
To deploy it:
```
models.deploy(initial_instance_count =1,
instance_type='ml.m4.xlarge',
serializer=JSONSerializer(),
deserializer=JSONDeserializer(),
endpoint_name='***-multi-model-deployment',
)
```
The deployment goes well, as there are no failures and the endpoint is InService by the end of this step.
However the error occurs when I try to run inference on one of the models:
```
import json
body = {"url":"https://***image.jpg"} #url to an image online
payload = json.dumps(body)
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName = "***-multi-model-deployment",
ContentType = "application/json",
TargetModel = "/model.tar.gz",
Body = payload)
```
This prompts an error message:
```
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "{
"code": 500,
"type": "InternalServerException",
"message": "Failed to start workers for model ec1cd509c40ca81ffc3fb09deb4599e2 version: 1.0"
}
". See https://***.console.aws.amazon.com/cloudwatch/home?region=***#logEventViewer:group=/aws/sagemaker/Endpoints/***-multi-model-deployment in account ***** for more information.
```
The Cloudwatch logs show this error in particular:
```
22-09-26T15:51:40,494 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in <module>
2022-09-26T15:51:40,494 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2022-09-26T15:51:40,494 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 181, in run_server
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 139, in handle_connection
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 104, in load_model
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_loader.py", line 151, in load
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 162, in validate_and_initialize
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._model = self._model_fn(model_dir)
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 73, in default_model_fn
2022-09-26T15:51:40,495 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise ValueError(
2022-09-26T15:51:40,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Exactly one .pth or .pt file is required for PyTorch models: []
```
It seems like it's having problems loading the model, saying only one .pth file is required, however in the invocation function i point to the exact model artifact present at that S3 bucket prefix. I'm having a hard time trying to fix this issue, so it would be very helpful if anyone had some suggestions!
Instead of giving the MultiDataModel a model, I also tried providing it an ECR docker image with the same inference code, but I would get the same error during invocation of the endpoint.
Hey all,
I am trying to run the script below in the writefile titled "vw_aws_a_bijlageprofile.py". This code has worked for me using other data sources, but now I am getting the following error message from the CloudWatch Logs:
"***2022-08-24T20:09:19.708-05:00
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv***"
Any idea how I get around this error?
Full code below.
Thank you in advance!!!!
```
%%writefile vw_aws_a_bijlageprofile.py
import os
import sys
import subprocess
def install(package):
subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('awswrangler')
install('tqdm')
install('pandas')
install('botocore')
install('ruamel.yaml')
install('pandas-profiling')
import awswrangler as wr
import pandas as pd
import numpy as np
import datetime as dt
from dateutil.relativedelta import relativedelta
from string import Template
import gc
import boto3
from pandas_profiling import ProfileReport
client = boto3.client('s3')
session = boto3.Session(region_name="eu-west-2")
def run_profile():
query = """
SELECT * FROM "intl-euro-archmcc-database"."vw_aws_a_bijlage"
;
"""
#swich table name above
tableforprofile = wr.athena.read_sql_query(query,
database="intl-euro-archmcc-database",
boto3_session=session,
ctas_approach=False,
workgroup='DataScientists')
print("read in the table queried above")
print("got rid of missing and added a new index")
profile_tblforprofile = ProfileReport(tableforprofile,
title="Pandas Profiling Report",
minimal=True)
print("Generated table profile")
return profile_tblforprofile
if __name__ == '__main__':
profile_tblforprofile = run_profile()
print("Generated outputs")
output_path_tblforprofile = ('/opt/ml/processing/output/profile_vw_aws_a_bijlage.html')
#switch profile name above
print(output_path_tblforprofile)
profile_tblforprofile.to_file(output_path_tblforprofile)
```
```
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
session = boto3.Session(region_name="eu-west-2")
bucket = 'intl-euro-uk-datascientist-prod'
prefix = 'Mark'
sm_session = sagemaker.Session(boto_session=session, default_bucket=bucket)
sm_session.upload_data(path='vw_aws_a_bijlageprofile.py',
bucket=bucket,
key_prefix=f'{prefix}/source')
```
```
import boto3
#import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
region = boto3.session.Session().region_name
S3_ROOT_PATH = "s3://{}/{}".format(bucket, prefix)
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session=sm_session,
instance_type='ml.m5.24xlarge',
instance_count=1)
```
```
sklearn_processor.run(code='s3://{}/{}/source/vw_aws_a_bijlageprofile.py'.format(bucket, prefix),
inputs=[],
outputs=[ProcessingOutput(output_name='output',
source='/opt/ml/processing/output',
destination='s3://intl-euro-uk-datascientist-prod/Mark/IODataProfiles/')])
```
I have an issue while getting Catboost image URI. It is a function for generating ECR image URIs for pre-built SageMaker Docker images. Here is my code catboost_container = sagemaker.image_uris.retrieve("catboost", my_region, "latest")
I am looking for Docker Registry Paths and Example Code for US East (N. Virginia) (us-east-1) for Catboost algorithm I could not find it in the following link : https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-1.html .
I previously ran a hyperparameter tuning job for SageMaker DeepAR with the instance type ml.c5.18xlarge but it seems insufficient to complete the tuning job within the max_run time specified in my account. Now, having tried to use the accelerated GPU instance ml.g4dn.16xlarge, I am prompted with an error - "Instance type ml.g4dn.16xlarge is not supported by algorithm forecasting-deepar."
I cannot find any documentation that indicates the list of instance types supported by deepar. What GPU/CPU instances have more compute capacity than ml.c5.18xlarge which I could leverage for my tuning job?
If there isn't, I would appreciate any recommendations as to how I could hasten the run time of the job. I require the tuning job to complete within the max run time of 432000 seconds. Thank you in advance!
I would like to try out DeepAR for an engineering problem that I have some sensor datasets for, but I am unsure how to set it up for ingestion into DeepAR to get a predictive model.
The data is essentially the positions, orientations, and a few other timeseries sensor readings of an assortment of objects (animals, in this case, actually) over time. Data is both noisy and sometimes missing.
So, in this case, there are N individuals and for each individual, there are Z variables of interest per individual. None of the variables are "static" (color, size, etc), they are all expected to be time-varying on the same time scale.
Ultimately, I would like to try and predict all Z targets for all N individuals.
How do I set up the timeseries to feed into DeepAR?
The premise is that all these individuals are implicitly interacting in the observed space, so all the target values have some interdependence on each other, which is what I would like to see if DeepAR can take into account to make predictions.
Should I be using a category vector of length 2, such that the first cat variable corresponds to the individual, and the second corresponds to one of the variables associated with the individual?
Then there would be N*Z targets in my input dataset, each with `cat = [ n , z ]`, where there are N distinct values for n, and z for Z?