Splitting a CSV file into 40,000 line chunks

dfg59 · December 2, 2006, 12:16am

Louis J Scoras wrote:

the whole set of elements in memory before you can begin sorting them.

In that case, you might want to just shove them in an RDBMS and let it
sort it for you.

Let’s say you want to sort by the foo column

Read in all the foo values and sort them
Get every 40,000th value from the list.
Now, upon reading any row, you can determine what page it should go on.
Read the file, get the rows for the first N pages, ignoring the rest of
the rows, where N is a number that won’t run you out of memory.
Create the files for those rows
Remove references to the rows you read in.
Repeat with the next N pages until finished.

dfg59 · December 1, 2006, 4:36pm

Paul L. wrote:

Nice, informative post. There are a lot of issues here, primarily the fact
that the database under discussion is too big to hold in memory, and it is
also too big to fit into Excel in one chunk, which appears to be its
destination.

Most people have begin to drift toward suggesting a database approach,
rather than anything that involves direct manipulation of the database in
Ruby. Because of the size of the database and because sorting the records
is one goal, I have to agree.

I haven’t “begun to drift” – I’ll flat out say, “Use a %^$&%^$(
database!”

–
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

dfg59 · December 2, 2006, 2:06am

Thanks everyone for ALL the replys. Lots of interesting things to think
about. I’ll take a look at using a database approach for this and I’m
looking at FasterCSV now. Also, some very good insight related to
building code from scratch and using libraries.

Another great example of the ruby community at work IMO.