Vote count:
0
I have two files with records and I want to do the following on Hadoop:
(Easy part)
For each Record in both files
compute some values from record and store in array representing the record
Then(The messy part)
For each record array computed in previous step from fileA
For each record array computed in previous step from FileB
IF they have X number of elements in common
print to output
This is what I am trying to do using Hadoop however I have no idea how to do this efficiently without using one reducer for the nested For Loop.
Any suggestions/ideas on how best to go about such a task?
I would prefer to use Python and streaming jar in hadoop.
Thanks
asked 1 min ago
Aucun commentaire:
Enregistrer un commentaire