mercredi 1 avril 2015

Optimize nutch performance on hadoop cluster


Vote count:

0




I'm trying to optimize nutch performance for crawling sites. Now i test performance on small hadoop cluster, only two nodes 32gb RAM, cpu Intel Xeon E3 1245v2 4c/8t. My config for nutch http://ift.tt/1aijvO1


So, the problem: fetching jobs works not optimal. Some reduce task has 4k pages for fetching, some 1kk pages. For example see screenshot http://ift.tt/1Im6TzO Some reduce task finished in 10 minutes, but one task work 11 hours and still continue working, so it's like a bottle neck when i have 24 reduce task, but works only one.


May be someone can give usefull advices or links where i can read about problem.



asked 1 min ago







Optimize nutch performance on hadoop cluster

Aucun commentaire:

Enregistrer un commentaire