Vote count:
0
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception:
(On driver)
org.apache.spark.SparkException: Invalid checkpoint directory: s3n://spooky-checkpoint/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198
at org.apache.spark.rdd.CheckpointRDD.getPartitions(CheckpointRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
...
(On Executor)
15/04/17 22:00:14 WARN StorageService: Encountered 4 Internal Server error(s), will retry in 800ms
15/04/17 22:00:15 WARN RestStorageService: Retrying request following error response: PUT '/9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025' -- ResponseCode: 500, ResponseStatus: Internal Server Error
...
After manually checking checkpointed files I found that /9e9dbddf-e5d8-478d-9b69-b5b966126d3c/rdd-198/part-00025 is indeed missing on S3. So my question is: if it is missing (perhaps due to AWS malfunction), why didn't spark detect it immediately in the checkpointing process (so it can be retried), instead of throwing an irrecoverable error stating that dependency tree is already lost? And how to avoid this situation from happening again?
asked 55 secs ago
How to avoid "Invalid checkpoint directory" error in apache Spark?
Aucun commentaire:
Enregistrer un commentaire