Being mindful of your spend while experimenting with ML.

4 minute read
Content level: Foundational
1

Avoid few common operational errors that could lead to unnecessarily elevated costs.

Introduction

Amazon Sagemaker gives ML practitioners the most comprehensive set of tools covering all aspects of Machine Learning workflow. Whether you’re a data scientist, ops engineer, solutions architect or a curiosity-driven ML enthusiast, inevitably you will be experimenting with the tools given at your disposal. As always, with great power comes great responsibility, and for the purpose of this article this means being frugal, ensuring you only consume what you need, when you need it.

I’m going to share with you a few selective examples from my ML journey when I wasn’t mindful of the resources I was leaving behind while experimenting. My hope is that you won’t make the same mistakes. Here they are:

1. SageMaker Canvas costs

Amazon SageMaker Canvas is a visual interface for building and deploying ML models with no code. When I first came across it, I was excited about its capabilities and I was willing to put them to the test. I ran most of the labs in the Immersion Day over the course of several days. Only about a week after I finished my experiments I realised that my usage was a bit high. Quick poke around the Cost Explorer revealed that Canvas was behind it:

Enter image description here

According to the pricing page a Workspace instance is billed at $1.9/hr. The charges only apply when the instance is active and the easy way to avoid them when you don’t need Canvas is to log out at the end of each session. There is also this schedule-based automated shutdown solution available.

2. Lingering model endpoints

Machine Learning models can be deployed in many ways in AWS. Whichever method you use, one thing will remain the same for all of them - they stay online unless you explicitly delete them. It happened to me once I didn’t. I was experimenting with this notebook to deploy FLAN-T5 and learn more about Fine Tuning. What I didn’t notice is that the SageMaker endpoint created by this line of code remains online:

pretrained_predictor = pretrained_model.deploy()

The mistake I made was skipping some lines that do the clean up, particularly this one:

pretrained_predictor.delete_endpoint()

This could have been easily avoided but even if you’re confident that you’ve done the right thing, it wouldn't hurt to double-check by navigating to SageMaker → Inference → Endpoints and verifying whether anything that shouldn’t be there appears to be online. here’s a screenshot showing what I created for the purpose of this article:

Enter image description here

Notebook instances

You can run Jupyter Notebooks from SageMaker. They can be launched from within the main console , under Notebook → Notebook Instances. You can also do it from the Studio by upening a .ipynb file and choosing the desired instance type. It will remain online while you work, but also once you’re done, even if you quit the Studio or AWS console. Here’s an example from my account where the default hardware is running (ml.t3.medium):

Enter image description here

This instance type costs only $0.05/hour, however ml.g4dn.xlarge is $0.7364/hour or about $120 a week. The unnecessary charges can be simply avoided by stopping the instance when you’re done with your work. An automated shutdown solution is also available.

Going the extra mile

Understanding your usage at an individual contributor level works well at a relatively small scale. What if we're in a larger organisation with projects and workloads in different AWS regions? Any existing practices and standards, when multiplied, will have even greater impact on our overall spend. There is a solution that lets us dive into our ML costs at any Scale. Cost and Usage Dashboards Operations Solution (CUDOS) is an Amazon QuickSight dashboard with many insightful visuals covering financial aspects of your AWS usage. In the AI/ML section you can find this graph that shows the cost of instances associated with SageMaker Studio and Notebook instances:

SageMaker Studio and Notebook instances in CUDOS

There's way more to explore in CUDOS. If you are curious a [demo dashboard] is available. You can also follow the deployment instructions to set it up in your AWS account.

Conclusion

Being aware of the resources we consuming and knowing how to avoid unnecessary charges is also important in the Machine Learning domain. Regardless of the scale of our work, responsible experimentation helps us reduce the costs and our carbon footprint. I hope that by sharing this content a mistake or two will be avoided. Thank you for reading.