No containers running but still instance not decommissioned

0

Hi Team, We have EMR 6.10 cluster where flink jobs submitted to existing application. Container was running in task node in my case. Then I resized the task instance group from 1 to 0 in task instance group. The running job was completed in a couple of minutes as per Yarn RM UI and there is no container running but the node stuck in decommissioning State for 1hr and then it removed.

I can configure yarn.resourcemanager.nodemanager-graceful'decommission-timeout-secs, but as per the EMR doc, "If there is no task running EMR remove the instance as per scaling policy", but seems something issue here, any leads are highly appreciated. Thanks in advance

Scott M
質問済み 2ヶ月前310ビュー
1回答
4
承認された回答

Hello,

Hadoop 3.3.3 introduced a change in YARN (YARN-9608) that keeps nodes where containers ran in a decommissioning state until the application completes. This change ensures that local data such as shuffle data doesn't get lost, and you don' need to re-run the job. This approach might also lead to underutilization of resources on clusters with or without managed scaling enabled.

With Amazon EMR releases 6.11.0 and higher as well as 6.8.1, 6.9.1, and 6.10.1, the value of yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications is set to false in yarn-site.xml to resolve this issue.

yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-applications - If true (the default), the resource manager waits for all containers, as well as all applications associated with those containers, to finish before gracefully decommissioning a node.

If false, the resource manager only waits for containers, but not applications, to finish. For map-only jobs or other jobs in which mappers do not need to serve shuffle data, this allows nodes to be decommissioned as soon as their containers are finished as opposed to when the job is done.

Add property yarn.resourcemanager.decommissioning-nodes-watcher.wait-for-app-masters. - If false, during graceful decommission, when the resource manager waits for all containers on a node to finish, it will not wait for app master containers to finish. Defaults to true. This property should only be set to false if app master failure is recoverable.

References:

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6101-release.html

AWS
サポートエンジニア
回答済み 2ヶ月前
profile picture
エキスパート
レビュー済み 2ヶ月前
  • Excellent, thank you so much for this information!

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ