Splitting a CSV file into 40,000 line chunks

On Thu, 2006-11-30 at 04:44 +0900, Drew O. wrote:

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I’d love to know. My unix is rusty (that’s being
generous).

rthompso@jhereg:~$ cat splitandhead
#!/bin/bash

fname=$1
split -l 4000 -a 6 -d $fname $fname

for file in ls ${fname}0*
do
sed -e ‘1i\Col1 Col2 Col3 Col4’ $file > ${file}.tmp
mv ${file}.tmp $file
done

rthompso@jhereg:~$ head testsed
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
this is a test
rthompso@jhereg:~$ wc testsed
40140 160560 602100 testsed
rthompso@jhereg:~$ time ./splitandhead testsed

real 0m0.499s
user 0m0.140s
sys 0m0.092s
rthompso@jhereg:~$ head -2 testsed0000*
==> testsed000000 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000001 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000002 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000003 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000004 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000005 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000006 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000007 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000008 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000009 <==
Col1 Col2 Col3 Col4
this is a test

==> testsed000010 <==
Col1 Col2 Col3 Col4
this is a test
rthompso@jhereg:~$

James Edward G. II wrote:

/ …

A plain-text CSV file uses linefeeds as record delimiters. A
program that
uses “readline” or “gets” splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds
embedded in
fields.

Your opinion doesn’t make you right on this one.

I didn’t make a statement of fact, so the term and concept of “right” is
not
appropriate. I used the term “should”. And I acknowledged that I was
wrong
later in the same post.

/ …

I know you’re a don’t-use-a-library guy and you know I disagree.

First, that characterization (“don’t-use-a-library guy”) is false.
Second,
this isn’t about libraries versus hand-written code, it is about
reliable,
robust file protocols. IMHO allowing linefeeds into CSV fields is plain
stupid, and this is not about right and wrong, it is about robustness.

In any case, the majority of the responders in this thread tried to use
simple linefeed parsing in their replies, as I did, with the same risk.

This is the reason why. The edge cases will get you every time.

The implication is that a failure to use a library will get you every
time.
This violates common sense in the most fundamental way, for a reason you
should certainly be able to detect.

Drew O. wrote:

     xyz_part_2.csv

counter+=1
writer.close()

I will ignore the CSV issue, not because it isn’t important, but simply
because I’m not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I’ve written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a(‘xyz.csv’).chunks(40000).each_with_index do |chunk,i|
File.write_fresh(“xyz_part_#{i+1}”,chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn’t exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can’t be said for the version without helper methods.

On Nov 29, 2006, at 5:25 PM, Edwin F. wrote:

I am perplexed by CSV.open.

Me too. :wink:

I wonder why it was not patterned more closely on IO? Any thoughts?

The author does not believe CSV objects should be IO like. If you
search the archives for my messages about including FasterCSV in the
standard library you will run into our debates on this subject.

FasterCSV’s open() behaves the way you expect.

James Edward G. II

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time). This makes it very hard to write code that can take advantage of
duck typing, e.g.

def get_batch(f, count)
lines = nil
count.times do
break unless line = f.gets
(lines ||= []) << line
end
lines
end

get_batch(File.open(“foo”), 40000) # No problem
get_batch(CSV.open(“foo”, “r”), 40000) # Error

I wonder why it was not patterned more closely on IO? Any thoughts?

M. Edward (Ed) Borasky wrote:

Paul L. wrote:

James Edward G. II wrote:

[snip]

Hmmmm … seems like a bunch of folks are going to a bunch of trouble
and to baby a tool that can only handle a CSV file with less than 65536
rows. Excel is broken … Excel sucks at math … Excel can’t handle
lots of rows and lots of columns … Excel costs money. Excel was a
great wonder in 1986 – 20 years later it’s lame beyond words.

I concur in all respects, based on direct experience.

I’m not sure about the column count limitations, but I know almost any
database will deal with a CSV file with more than 65536 rows. There are
some awfully good free databases. Did I mention how bad Excel is at
math?

It certainly bears repeating. I’ve found a number of math errors in
Excel
over the years.

Sometimes you don’t get what you pay for. :slight_smile:

IMHO Microsoft represents a negative correlation across the board.

Paul L. wrote:

James Edward G. II wrote:

[snip]

Hmmmm … seems like a bunch of folks are going to a bunch of trouble
and to baby a tool that can only handle a CSV file with less than 65536
rows. Excel is broken … Excel sucks at math … Excel can’t handle
lots of rows and lots of columns … Excel costs money. Excel was a
great wonder in 1986 – 20 years later it’s lame beyond words.

I’m not sure about the column count limitations, but I know almost any
database will deal with a CSV file with more than 65536 rows. There are
some awfully good free databases. Did I mention how bad Excel is at
math? Sometimes you don’t get what you pay for. :slight_smile:


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

Edwin F. wrote:

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time).

/ …

I wonder why it was not patterned more closely on IO? Any thoughts?

This is an experience with which I am becoming familiar. Someone
requests a
solution to a problem. Someone else offers the option of a library to
solve
the problem. Then the original problem fades into the background,
replaced
by discussion of the library’s problems.

This same pattern has repeated itself about four times in the past
fortnight, in just this one newsgroup.

I can be relied on to suggest a terse code solution. Then someone else
can
be relied on to point out, correctly, that a terse code solution may
miss
edge cases and, if exposed to enough data, will surely fail. Absolutely
correct.

So, just for variety, I will not say “Have you considered writing your
own
code?” just because that’s what people expect me to say. I won’t do this
because I now realize “code” is a trigger word, just like saying
“abortion”
among fundamentalists – that is something you just don’t want to do.

So I will say “Have you considered writing your own library?” It amounts
to
the same thing, since libraries are written using code, and all code is
written by mortals, but this way of saying it avoids the trigger word
“code”.

Your own code … er, excuse me, your own library … will meet your
requirements exactly, it won’t cover cases that are not relevant to the
problem at hand, it will be much faster overall than existing solutions,
and you will learn things about Ruby that you would not if you used
someone
else’s library.

In this specific case, as has been pointed out to me, a CSV field can
contain linefeeds, which means – if your data exploits this trait –
you
need to parse the entire database using a state machine that knows about
this possibility.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds), you can scan the data much
more
quickly using a simpler solution, but a solution that will certainly
fail
if the above assumption turns out to be false. Code like this:


#!/usr/bin/ruby -w

max_output_lines = 40000

input_file = “test.txt”

output_base = “output”

n = 0

ifile = File.open(input_file,“r”)

header = ifile.gets

until(ifile.eof?)
  ofn = output_base + sprintf("%03d",n) + “.txt”
  ofile = File.open(ofn,“w”)
  ofile.write(header)
  line = 2
  until(ifile.eof? || line > max_output_lines)
   ofile.write(ifile.gets)
   line += 1
  end
  ofile.close
  n += 1
end

ifile.close


Note that I meet your requirement to place the original header line at
the
top of each database section.

If you will call this a “library”, it will pass muster with those who
prefer
the word “library” to the word “code”. Outside the box, it’s all the
same.
Inside the box, not at all.

On Nov 30, 2006, at 12:36 PM, Paul L. wrote:

Your own code … er, excuse me, your own library … will meet your
requirements exactly, it won’t cover cases that are not relevant to
the
problem at hand, it will be much faster overall than existing
solutions,
and you will learn things about Ruby that you would not if you used
someone
else’s library.

Now you’re guilty of a new sin: encouraging people to reinvent the
wheel. You just can’t win, can you? :wink:

Different problems require different levels of paranoia. Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don’t use a robust library. Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it’s even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn’t
meeting. As soon as things start getting serious though, I usually
feel safer reaching for the library.

The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)…

Really? How do they handle data with newlines in it?

Which “CSV databases” are you referring to here?

James Edward G. II

On 11/29/06, James Edward G. II [email protected] wrote:

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That’s not going to be fast
with any library helping you, but I don’t have a better idea.

James Edward G. II

Indeed. That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting. The
problem is you’ll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records → [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory. Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you’ve been doing.

Basically, you’re looking at a complicated swartzian transformation.
If it will work depends on how big your records are. If they are
fairly large, you might be able to pull if off; however, if they’re
small and the problem is only that there are too many records, you’ll
still have a problem.

In that case, you might want to just shove them in an RDBMS and let it
sort it for you.

James Edward G. II wrote:

Now you’re guilty of a new sin: encouraging people to reinvent the
wheel. You just can’t win, can you? :wink:

If the OP has a problem not easily solved with a library, then he isn’t
reinventing the wheel. And I don’t care about winning.

Different problems require different levels of paranoia.

Yes, absolutely. The larger the job and the larger the data set, the
more
likely one will encounter border conditions, and the more appropriate to
use a state machine that understands the full specification. All at the
cost of speed.

Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don’t use a robust library.

Yes, but a robust library is not appropriate if it cannot solve the
problem,
or if the learning curve is so steep that it would be easier to write
one’s
own scanner.

Also, there is a hidden assumption in your position – that libraries,
ipso
facto, represent robust methods.

Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it’s even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn’t
meeting.

That borders on the heretical. :slight_smile:

As soon as things start getting serious though, I usually
feel safer reaching for the library.

I’ve noticed that. I want to emphasize once again that my style is a
personal preference, not an appeal to authority or untestable precepts.

The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.

I think 90% of the readers of this newsgroup won’t pay any attention to
either of our opinions on this topic. They will realize that inside
every
library is code written by a mortal human, therefore this sort of debate
is
primarily tilting at windmills or describing angel occupancy
requirements
for heads of pins.

For the newbies, however, it might matter. They might think library
contents
differ from ordinary code. And that is true only if the writers of
libraries differ from ordinary coders. Ultimately, they don’t, as
Microsoft
keeps finding out.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)…

Really? How do they handle data with newlines in it?

Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant
of
the specification’s language on this (an explanation, not an excuse).

Which “CSV databases” are you referring to here?

MySQL, the database I am most familiar with, uses this method for import
or
export of comma- or tab-separated plain-text data. Within MySQL’s own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

I create a lot of plain-text databases, and I am constantly presenting
them
to MySQL for parsing (or getting plain-text back from MySQL), and this
only
confirmed my mistaken impression that linefeeds are always escaped in
fields of this class of database.

It’s obvious why the specification reads as it does, and I should have
known
about this long ago. It reads as it does because it just isn’t that
difficult to parse a quoted field, and it is no big deal to allow
absolutely anything in the field. It just takes longer if all the
database
handling (not just record parsing) must use the same state machine that
field parsing must use.

It’s very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

When a group of people sit down to create a specification, the highest
priority is … utility, common sense? … no, it’s immunity to
criticism.
The easies way to avoid criticism is to allow absolutely anything, even
if
this hurts performance in real-world embodiments that obey the
specification.

Someone might say, “Okay, but can you drop an entire, internally
consistent
CSV database into the boundaries of a single field of another CSV
database,
without any harm or lost data?” Using the present specification, the
committee can say “yes, absolutely.”

But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.

I cannot tell you how many times I have foolishly said, “surely the
specification doesn’t allow that!”, and I cannot remember actually ever
being right after taking such a position. When I make assumptions about
committees, I am always wrong.

2006/11/30, Drew O. [email protected]:

I’ll give FasterCSV a try when I get home from work and out from behind
this proxy. Here’s another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I’m trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

You could import it into MS Access, sort there and export again. Maybe
you could even do all the splitting up in Access directly, but I don’t
know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that’s required.

That’s not very “rubyish”, of course :slight_smile:

Thomas

On Nov 30, 2006, at 2:45 PM, Paul L. wrote:

Also, there is a hidden assumption in your position – that
libraries, ipso facto, represent robust methods.

For the newbies, however, it might matter. They might think library
contents differ from ordinary code.

I sure hope they think that! I know I do.

There’s no faster way to find bugs than to bundle up some code and
turn it loose on the world. That leads to more robust code. This is
the reason open source development works so well.

If one of us patches a library, everyone benefits. It’s like having
a few hundred extra programmers on your staff.

Yes, I realize I’m over generalizing there. There will always be
poorly supported or weak libraries, but someone just forks or
replaces those eventually.

On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)…

Really? How do they handle data with newlines in it?

Linefeeds are escaped as though in a normal quoted string. This is
how I
have always dealt with embedded linefeeds, which is why I was
ignorant of
the specification’s language on this (an explanation, not an excuse).

So a linefeed is \n and then we need to escape the \ so that is \, I
assume. Interesting.

I would argue that is not CSV, but it’s certainly debatable. My
reasoning is that you either need to post process the CSV parsed data
to restore it or use a custom parser that understands CSV plus your
escaping rules.

Which “CSV databases” are you referring to here?

MySQL, the database I am most familiar with, uses this method for
import or
export of comma- or tab-separated plain-text data. Within MySQL’s own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

Wild. I use MySQL everyday. Guess I’ve never dumped a CSV of
linefeed containing data with it though. (I generally walk the
database myself with a Ruby script and dump with FasterCSV.)

It just takes longer if all the database
handling (not just record parsing) must use the same state machine
that
field parsing must use.

I don’t understand this comment. MySQL does not use CSV internally,
like most databases.

It’s very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library? :wink: It’s pretty zippy.

But parsing will necessarily be slow, character by character, the
entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few
extra
states.

You don’t really have to parse CSV character by character. FasterCSV
does most of its parsing with a single highly optimized (to avoid
backtracking) regular expression and a few tricks.

Basically you can read line by line and divide into fields. If you
have an unclosed field at the end of the line, you hit an embedded
linefeed. You then just pull and append the next line and continue
eating fields.

The standard CSV library does not do this and that is one of two big
reasons it is so slow.

James Edward G. II

James Edward G. II wrote:

/ …

So a linefeed is \n and then we need to escape the \ so that is \, I
assume. Interesting.

I would argue that is not CSV, but it’s certainly debatable.

Not really debatable IMHO. Embedded linefeeds are optional, and the
two-character “\n” sequence are allowed characters. It’s an
application-specific option that happens to be fairly common.

Consider what would happen if someone embedded a bunch of HTML break
tags in
CSV fields. Does that mean it isn’t CSV? Same idea. It’s just a common
convention to symbolize a linefeed.

My
reasoning is that you either need to post process the CSV parsed data
to restore it or use a custom parser that understands CSV plus your
escaping rules.

Or you can leave it alone and let the calling application deal with it.
In a
case like this, it is often two cooperating applications, both of which
know the conventions in use. The CSV parser doesn’t have to address this
convention directly.

/ …

Wild. I use MySQL everyday. Guess I’ve never dumped a CSV of
linefeed containing data with it though. (I generally walk the
database myself with a Ruby script and dump with FasterCSV.)

If you have a database with embedded linefeeds, and you dump it to the
console:

echo “Select * from database.table;” | mysql (authorization)

You will see the linefeeds escaped in the output (the fields are
tab-separated, not comma-separated, but the result is the same).

Oh, that reminds me. Tabs are also often represented symbolically in the
same general way, to avoid a conflict with tab field delimiters, when
they
are used.

It just takes longer if all the database
handling (not just record parsing) must use the same state machine
that
field parsing must use.

I don’t understand this comment. MySQL does not use CSV internally,
like most databases.

Yes, I know. Here I am referring to the task of parsing CSV tables, it’s
no
longer about MySQL.

It’s very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library? :wink: It’s pretty zippy.

Yes, and whatever speed it achieves, it is proportionally slower to the
degree that it follows the requirement to allow linefeeds inside fields.

But parsing will necessarily be slow, character by character, the
entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few
extra
states.

You don’t really have to parse CSV character by character. FasterCSV
does most of its parsing with a single highly optimized (to avoid
backtracking) regular expression and a few tricks.

Then the regex engine parses character by character, rather than the
explicit Ruby code. Same proportions as explained above, but probably
much
faster overall (I assume Ruby’s regex engine is written in native code).

Basically you can read line by line and divide into fields.

Actually you must do both at once. You can’t simply parse record
delimiters
without paying attention to the field delimiters and field contents. I
know
you know this, I am just adding it for other readers.

If you
have an unclosed field at the end of the line, you hit an embedded
linefeed. You then just pull and append the next line and continue
eating fields.

It must be some regex scheme to simultaneously deal with embedded field
delimiters as well as the various quote issues and, finally, embedded
record delimiters.

The standard CSV library does not do this and that is one of two big
reasons it is so slow.

I’m glad to see there is healthy competition for quality code, even in
libraries.

Thomas M. wrote:

know too much about Access to help with that.
And I guess using Access would make it harder to automate this whole
task, in case that’s required.

That’s not very “rubyish”, of course :slight_smile:

Thomas

Well … databases are highly Ruby-ish! Every major database,
including, I’m guessing, MS Abscess*, has a simple Ruby interface.
Reading a huge file into memory is the wrong approach.

  • Abscess – An abscess is a localized collection of pus in any part of
    the body, caused by an infection.

http://www.nlm.nih.gov/medlineplus/ency/article/001353.htm


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

Drew O. wrote:

James -

I’ll give FasterCSV a try when I get home from work and out from behind
this proxy. Here’s another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I’m trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

On Linux (or CygWin, since you’re on Windows):

$ sort -k 4 -t , >

That will probably put the header line at the bottom of the file or
perhaps the middle, but that’s pretty easy to fix.

$ cp
$ grep -v |
sort -k 4 -t , >>

However:

If the file is truly CSV (numeric values unquoted, date and time stamps
in official format for your locale, and all text fields quoted) there is
actually a way you can treat it like a table in a database with ODBC.
Open your little ODBC data sources gizmo and create a “File DSN”. Use
the “text” driver. It will let you configure it so that the “database”
is the directory where your CSV file lives and your CSV file is a table
in that database. Then anything that can connect to an ODBC Data Source
Name will be able to query your CSV file.

If the sort is numeric, add a “-n” to the sort command above. If it’s
descending, add “-r” and if it’s both numeric and descending add “-nr”.
“man sort” is your friend.


M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given
rabbits fire.

Thomas M. wrote:

2006/11/30, Drew O. [email protected]:

I’ll give FasterCSV a try when I get home from work and out from behind
this proxy. Here’s another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I’m trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

I’m coming to this party really late, so I hope I don’t come across as
shamelessly plugging KirbyBase, but, you might want to try it for this.

If you are simply trying to take a large csv file, sort it by one of its
fields, and split it up into smaller files that each contain 40,000
records, I think it might work.

Here’s some code (not tested, could be incorrect) off the top of my
head:

require ‘kirbybase’

db = KirbyBase.new

tbl = db.create_table(:foo, :field1, :String, :field2, :Integer,
:field3, :String…

tbl.import_csv(name_of_csv_file)

rec_count = tbl.total_recs
last_recno_written_out = 0

while rec_count > 0
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno <
last_recno_written_out + 40000 }.sort(:field4)

… here is where you put the code to write these 40,000 recs to
a csv output file …

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end

KirbyBase will even use FasterCSV for it’s csv stuff if you have it
installed. :slight_smile:

Anyway, hope this helps. If I have totally misunderstood the request,
feel free to ignore!

Jamey C.

Paul L. wrote:

Edwin F. wrote:

I am perplexed by CSV.open. In IO and File, open returns something that
quacks like an IO object. You can then call gets, puts, read, write and
so on. The CSV open seems to return an array (or gives you a row at a
time).

/ …

I wonder why it was not patterned more closely on IO? Any thoughts?

This is an experience with which I am becoming familiar. Someone
requests a
solution to a problem. Someone else offers the option of a library to
solve
the problem. Then the original problem fades into the background,
replaced
by discussion of the library’s problems.

This same pattern has repeated itself about four times in the past
fortnight, in just this one newsgroup.
/snip/

Ummm, I am not sure exactly how to interpret the above post, but I see
my name quoted there, so I feel compelled to clarify what I was thinking
in making my original post. I had just written a small Ruby program that
would satisfy the OP’s stated problem, but using IO/File. While I was
doing this, more posts appeared, which alerted me to the possibility
that I would have to cater for newlines in the input., “Oh well”, I
thought, "I’ll just replace every use of “IO” with “CSV”, and that will
be that. BZZZ! Wrong! Thank you for playing. I couldn’t drop in CSV
instead of IO? WTF???

This is where my perplexity came in. Matz himself has regularly and
clearly stated that he designed Ruby along the Principle Of Least
Surprise (or LOLA, Law of Least Astonishment). Well, I was grievously
surprised and astonished when CSV#open behaved differently to every open
I have used in any language. All the other opens that I know of return
the concept of a handle/object, or some thing that can then be
beseeched to bring forth the contents of the actual I/O “device”, one
element at a time, or all at once. The CSV#open skips this step and goes
straight from open to bringing forth entities, and thereby breaks
possible compatibility with IO/File. IMHO, this is inelegant design.

I have written many, many libraries (not in Ruby) and know how important
it is to present to your users a consistent, clean, orthogonal,
robust,reliable set of interfaces and implementations. That’s why it is
inadvisable to release a very early version of a library to a large
audience of end users until it has proven itself in battle, as it were.
Otherwise, you face the prospect of having to change the interface to be
less surprising (and keep a backward-compatible, deprecated one) and
re-releasing it to possibly hundreds of users.

The bottom line is, although I am perfectly capable of doing so, I don’t
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven’t had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the “Tao” of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know…)

Well, that’s my 2c worth :slight_smile:

Jamey C. wrote:

shamelessly plugging KirbyBase, but, you might want to try it for this.
db = KirbyBase.new
recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno
< last_recno_written_out + 40000 }.sort(:field4)

… here is where you put the code to write these 40,000 recs to
a csv output file …

last_recno_written_out = recs.last.recno

rec_count = rec_count - 40000
end

I realized this morning that the solution I posted last night won’t work
because you need the whole dataset sorted before you start splitting
it up into 40,000 record files. Oops!

Anyway, in an attempt to recover gracefully from my mistake and also to
give me the opportunity to shamelessly plug another one of my libraries,
I present the following proposed solution that is totally untested and
probably full of holes:

require ‘mongoose’
db = Mongoose::Database.new

db.create_table(:foo) do |tbl|
tbl.add_column(:field1, :string)
tbl.add_column(:field2, :string)
tbl.add_column(:field3, :integer)
tbl.add_indexed_column(:field4, :string)
.
.
.
end

Foo.import(csv_filename)

total_recs_written = 0

while total_recs_written < Foo.last_id_used
recs = Foo.find(:order => :field4, :offset => total_recs_written,
:limit => 40000)

........ here is where you put the code to write these 40,000 recs

to a csv output file …

total_recs_written += recs.size

end

Jamey

Confidentiality Notice: This email message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and/or privileged information. If you are not the intended
recipient(s), you are hereby notified that any dissemination,
unauthorized review, use, disclosure or distribution of this email and
any materials contained in any attachments is prohibited. If you receive
this message in error, or are not the intended recipient(s), please
immediately notify the sender by email and destroy all copies of the
original message, including attachments.

Edwin F. wrote:

/ …

The bottom line is, although I am perfectly capable of doing so, I don’t
WANT to reinvent the wheel. I really like reusing good, dependable,
predictable code. I haven’t had time to look at FasterCSV yet, but I
predict that I will like what I see, because to my mind, from the works
of his I have seen, the author does his best to embody the “Tao” of Ruby
(LOLA). (Although you can never accurately describe the Tao of anything,
I know…)

Nice, informative post. There are a lot of issues here, primarily the
fact
that the database under discussion is too big to hold in memory, and it
is
also too big to fit into Excel in one chunk, which appears to be its
destination.

Most people have begin to drift toward suggesting a database approach,
rather than anything that involves direct manipulation of the database
in
Ruby. Because of the size of the database and because sorting the
records
is one goal, I have to agree.