Vote count:
0
I have a Json file which is 1 Terabytes in size. Each Json Object is a text with 500-600 words. There are 50 million Json objects.
Now this is what I have to do with this Json file. I need to insert 200-300 words and a percentage value into a web page. Once this is done, the web application will read the entire Json file checking whether the inserted words are available in any Json object, and what is the percentage of the availability. If the availability percentage is higher than the percentage I inserted then this application will also keep track of words available in Json Object compared to the input list and words missing in Json Object compared to the input list.
I felt reading 1TB is too big, so I did a trick. I converted the text in every Json Object into hash (this hash represents any word with 3 characters) and saved it into a text file. Now the hash in every new line of this text file represents the text in that Particular Json Object. This text file is 120GB big. 50 Million lines.
My problem is that reading and performing the above job is still harder. It takes hours of time to complete! Why? Because the application read "every" line in this Hash, search which words are available and which words are not. So this "checking" algorithms are running for 50 million times!
Is there anyway I can reduce the time of this operation and do it within few seconds? I know applications in chemistry and genetic medicine does the exact same thing within seconds! I am open to all the solutions, whether it is a Big data solution, data mining or a simple fix, whatever. Please help.
Aucun commentaire:
Enregistrer un commentaire