Glue on Ray: Is this supposed to work?

0

Glue on AWS Ray is unusable in its current state, despite entering GA back in June.

According to the documentation, each account has a quota of 50 M-DPUs, where each worker instance accounts for 2 M-DPUs. Ergo, an account should be able to run up to 25 worker instances, but jobs are unable to scale past 5 worker instances. The ray-monitor logs are full of abortive efforts 2023-12-01 17:28:05,793 WARNING manta_cluster_manager.py:128 -- Create node failed as compute resource limits were reached.

So that's bad, I can run a maximum of 5 nodes, including the head node, on a job. What makes this worse is that all jobs time out after 60 minutes. The UI has a drop-down for job maximum timeout, which is disabled, and set to 480 minutes, but after 60 minutes jobs fail with the error reached jobrun maximum timeout

Trying to start-job-run from cli confirms that changing timeout is not supported for Glue jobs.

I really, really want to use Ray for my big data jobs because life is too short for PySpark, but I can't run a job to completion if I can't increase either the timeout or the number of worker nodes. Is anyone using this successfully?

Edit: Amusingly, if I set the number of workers to, say, 16, then the UI does reflect that capacity, and tells me I've used 32 DPU-hours for a job that was unable to use more than 10. I do hope I won't be charged for them all.

asked 5 months ago280 views
2 Answers
0
Accepted Answer

Hi Henry, on which account? I have a multi-account setup. I have tested with both my sandbox account, and our datalake account on my company's org. Both continue to experience the same resource limit. This Builder ID is linked to my personal aws org. I haven't run Glue Ray jobs on any other accounts, I don't think.

Is it possible to speak in a less async, or less public, way?

	2023-12-05T09:10:52.553+00:00	======== Autoscaler status: 2023-12-05 09:10:52.397427 ========
	2023-12-05T09:10:52.553+00:00	Node status
	2023-12-05T09:10:52.553+00:00	---------------------------------------------------------------
	2023-12-05T09:10:52.553+00:00	Healthy: 1 ray.head.default 4 ray.worker.default
	2023-12-05T09:10:52.553+00:00	Pending: (no pending nodes)
	2023-12-05T09:10:52.553+00:00	Recent failures: (no failures)
	2023-12-05T09:10:52.553+00:00	Resources
	2023-12-05T09:10:52.553+00:00	---------------------------------------------------------------
	2023-12-05T09:10:52.553+00:00	Usage: 40.0/40.0 CPU 0B/212.62GiB memory 562.14KiB/93.80GiB object_store_memory
	2023-12-05T09:10:52.553+00:00	Demands: {'CPU': 1.0}: 312+ pending tasks/actors
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,418 INFO autoscaler.py:1370 -- StandardAutoscaler: Queue 5 new nodes for launch
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,418 INFO autoscaler.py:466 -- The autoscaler took 0.098 seconds to complete the update iteration.
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,418 INFO node_launcher.py:166 -- NodeLauncher1: Got 5 nodes to launch.
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,418 INFO monitor.py:429 -- :event_summary:Adding 5 node(s) of type ray.worker.default.
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,419 INFO manta_cluster_manager.py:89 -- Creating nodes with config {'ExecutorSizeInDpu': 1}, ...
	2023-12-05T09:10:52.553+00:00	2023-12-05 09:10:52,498 WARNING manta_cluster_manager.py:128 -- Create node failed as compute resource limits were reached
	2023-12-05T09:10:52.803+00:00	2023-12-05 09:10:52,499 WARNING manta_cluster_manager.py:128 -- Create node failed as compute resource limits were reached
	2023-12-05T09:10:52.803+00:00	2023-12-05 09:10:52,581 WARNING manta_cluster_manager.py:128 -- Create node failed as compute resource limits were reached
	2023-12-05T09:10:52.803+00:00	2023-12-05 09:10:52,583 WARNING manta_cluster_manager.py:128 -- Create node failed as compute resource limits were reached
	2023-12-05T09:10:52.803+00:00	2023-12-05 09:10:52,583 INFO manta_cluster_manager.py:131 -- Successfully created 5 executors
answered 5 months ago
  • Hi Bob. This has been fixed since 12/8 for your sandbox and datalake account in your company's org. We got the account IDs via AWS support tickets. (On 12/4, the fix had not captured the correct scope, that's why you experienced the issue again). We've provided more details via your account team. Thanks !

0

Hi Bob, thanks for your feedback. It seems the Job is hitting some account limits. We are checking, Thanks

Henry
answered 5 months ago
  • Godspeed, Henry! I've seen the same limits on two separate AWS accounts. I did wonder if there was some undocumented soft-limit to stop new users accidentally deploying ALL the clusters, but there no other Glue jobs running concurrently.

  • hey @Bob, I have reviewed your account limits in the eu-west-1 Region. Could you please rerun the Glue for Ray jobs using the AWS account you used earlier ? Thanks

  • Hi @henr, on which account? I have a multi-account setup. I have tested with both my sandbox account, and our datalake account on my company's org. Both continue to experience the same resource limit. This Builder ID is linked to my personal aws org. I haven't run Glue Ray jobs on any other accounts, I don't think.

    Is it possible to speak in a less async, or less public, way?

  • @henry, thanks for the fix. I just want you to know that you're a beautiful person.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions