dimanche 20 avril 2014

Do Hadoop must have input files? Can I load local files as inputs directly?


Vote count:

0




Background: My program was developed in a stand-alone environment using C++, and now I need to deploy it on Hadoop, so I choose Hadoop Streaming(Hadoop pipes also does) technology. The inputs are some very large binary stream files which is captured by wireshark(packet, pcap format), I need to do flow analysis. If I open one of the files using hex format, it seems like this:



D4 C3 B2 A1 02 00 04 00 00 00 00 00 00 00 00 00
FF FF 00 00 01 00 00 00 E9 EA 9F 4F A9 1C 07 00
59 00 00 00 59 00 00 00 94 0C 6D 7A AD 3A 00 0C
29 F5 60 C8 08 00 45 00 00 4B FA BB 00 00 80 11
A9 61 C0 A8 30 69 77 93 2D E0 0F A0 1F 40 00 37
0C 70 02 21 07 00 58 28 98 22 5C B0 03 02 00 00
00 01 01 01 00 00 64 86 2B A2 02 8A 1A 68 EA 92


Problem: On hadoop, the default format of input files is TextInputFormat, but I need to handle binary files. As I know, Hadoop can also handle binary files using SequenceFileInputFormat, but I donnot know what is 'key, value' pair at this time? I have another idea: no input files on Hadoop Streaming, but load the local binary stream files as inputs (using -files option), and then using fopen(" stream file ") in mapper program to open each of stream files for processing, If so, there is no need to change my stand-alone program completely, but my question is : is each slave node still helpful? will master node assign such task to each slave node evenly(will the process be assigned to each slave node to handle it?) , It looks like it's not Hadoop . Is there another better idea? Thanks in advance.



asked 10 secs ago






Aucun commentaire:

Enregistrer un commentaire