Spark application takes longer than expected in emr 7

Question

I have spark application running in emr 7 that took 15+ hours which was taken 9 hours in emr 6.14. There is no code change and data volume changes. One observation is the application attempted thrice in emr 7 that would answered overall delay caused in emr 7.

Question is why it was reattempted only in emr 7 but not in emr 6.14. Really appreciate if someone help me fix this issue. Is there anything I need to handle specifically in emr 7?

Accepted Answer

To insist AM to launch only on core node(on-demand instance), you can enable the yarn node label with below params,

```
[
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.node-labels.enabled": "true",
      "yarn.node-labels.am.default-node-label-expression": "CORE"
    }
  }
]
```

You can add the above properties when [provisioning the cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-create-cluster.html) or [reconfigure the existing cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html). Once they are set, please verify them by connecting primary instance and run the below command

**`yarn cluster --list-node-labels
`**

This will provide the output of the updated label which will make sure to apply the AM only on core instance. More details are specified in this [document](https://repost.aws/knowledge-center/emr-configure-node-labeling-yarn) for your reference.

Answer

Hello,

Possibly the application master could have been launched in any core/task node would became unhealthy or decommissioned during the execution or there could be any resource bottlenecks would have occurred on the instance where the AM launched that might reattempted three times in EMR 7.0 version.

Please check where the application master container launched by checking the driver log. If you submitted the application via EMR step, check the stderr log and get the "Application master host name".  Check the resource utilization of the node and investigate if they were terminated due to any reasons.

It also be the case if you would have chosen the task node as Spot instance and the AM would be launched on the spot instance which might be terminated due to spot capacity interruption. Because EMR 6.x and 7.x, AM can be launched either on core or task nodes. If it was hosted on task/core node with spot type, then it would ended up in spot termination.

Answer

Thanks Yokesh. I see the application master hosted on one of the task nodes which is a Spot instance that led to termination twice. In Emr 6.14, the AM was launched on core node but not in Emr 7.

How I can make sure to run the application master on on demand node. In my cluster, I have core node as on demand and task nodes as spot.

Spark application takes longer than expected in emr 7

관련 콘텐츠