3D grphique: How to avoid "Invalid checkpoint directory" error in apache Spark?

vendredi 17 avril 2015

How to avoid "Invalid checkpoint directory" error in apache Spark?

Vote count:

0

I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception:


(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
    at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...
(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error
...

After manually checking checkpointed files I found that /9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025 is indeed missing on S3. So my question is: if it is missing (perhaps due to AWS malfunction), why didn't spark detect it immediately in the checkpointing process (so it can be retried), instead of throwing an irrecoverable error stating that dependency tree is already lost? And how to avoid this situation from happening again?

3D grphique

vendredi 17 avril 2015

How to avoid "Invalid checkpoint directory" error in apache Spark?

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

vendredi 17 avril 2015

How to avoid "Invalid checkpoint directory" error in apache Spark?

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0