dimanche 13 avril 2014

Need Help to Reduce Processing Time for Large CSV Data


Vote count:

0




I've read through some of the previous questions on speed up processing of large CSV data. I've implement some of the ideas and i got some improvement on processing time. However i still need to further cut down the processing time hopefully someone can help me.


I think my code is too long, I'll try to simplify. Here is what my code suppose to do:

1. Read through a csv file.

2. Group the data by first column; calculate total sum of each column and return the result.


Example (Raw Data):

A B C

1 2 3

1 2 3

2 4 4

2 4 4


Result:

A B C

1 4 6

2 8 8


Note: My actual data will be 100MB file with 630 columns and 29000 rows, total 18.27M records.


Here is how i achieve it:


Method 1:

1. Read a csv file through Filestream.

2. Use Split to split the returned string and process line by line, field by field.

3. Storing the result in an array and save the result in a text file.


Note on Method1: Time to process the data using this method takes ~1 min 20 secs.


Method 2:

1. Read a csv file through Filestream.

2. Feed the data into different threads before start process. (For now i feed 100 lines of data into different thread, fix 5 threads for now due to CPU resource constraint)

3. Use Split to split the returned string and process line by line, field by field in each thread.

4. Join all result from every threads and store in an array. Save the result in text file.


Note on Method 2: Time to process the data using this method takes ~50 secs.


So i got ~30secs improvement migrating from Method 1 to Method 2. I was wondering whether what i can do to further improve the process time. I've tried to cut down the data into smaller section like 100 lines x 100 columns and process it but the time to process the data become longer instead.


Hopefully some one can help me on this.


Thank you in advance.



asked 21 secs ago






Aucun commentaire:

Enregistrer un commentaire