James Edward G. II wrote:
Now you’re guilty of a new sin: encouraging people to reinvent the
wheel. You just can’t win, can you?
If the OP has a problem not easily solved with a library, then he isn’t
reinventing the wheel. And I don’t care about winning.
Different problems require different levels of paranoia.
Yes, absolutely. The larger the job and the larger the data set, the
more
likely one will encounter border conditions, and the more appropriate to
use a state machine that understands the full specification. All at the
cost of speed.
Sometimes a
little code will get you over the hump, but you may be making some
trade-offs when you don’t use a robust library.
Yes, but a robust library is not appropriate if it cannot solve the
problem,
or if the learning curve is so steep that it would be easier to write
one’s
own scanner.
Also, there is a hidden assumption in your position – that libraries,
ipso
facto, represent robust methods.
Sometimes those are
even good trade-offs, like sacrificing edge case handling to gain
some speed. Sometimes it’s even part of the goal to avoid the
library, like when I built FasterCSV to address some needs CSV wasn’t
meeting.
That borders on the heretical.
As soon as things start getting serious though, I usually
feel safer reaching for the library.
I’ve noticed that. I want to emphasize once again that my style is a
personal preference, not an appeal to authority or untestable precepts.
The people reading this list have seen us debate the issue now and be
able to make well informed decisions about what they think is best.
I think 90% of the readers of this newsgroup won’t pay any attention to
either of our opinions on this topic. They will realize that inside
every
library is code written by a mortal human, therefore this sort of debate
is
primarily tilting at windmills or describing angel occupancy
requirements
for heads of pins.
For the newbies, however, it might matter. They might think library
contents
differ from ordinary code. And that is true only if the writers of
libraries differ from ordinary coders. Ultimately, they don’t, as
Microsoft
keeps finding out.
On the other hand, if your data does not exploit this CSV trait (few
real-world CSV databases embed linefeeds)…
Really? How do they handle data with newlines in it?
Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant
of
the specification’s language on this (an explanation, not an excuse).
Which “CSV databases” are you referring to here?
MySQL, the database I am most familiar with, uses this method for import
or
export of comma- or tab-separated plain-text data. Within MySQL’s own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.
I create a lot of plain-text databases, and I am constantly presenting
them
to MySQL for parsing (or getting plain-text back from MySQL), and this
only
confirmed my mistaken impression that linefeeds are always escaped in
fields of this class of database.
It’s obvious why the specification reads as it does, and I should have
known
about this long ago. It reads as it does because it just isn’t that
difficult to parse a quoted field, and it is no big deal to allow
absolutely anything in the field. It just takes longer if all the
database
handling (not just record parsing) must use the same state machine that
field parsing must use.
It’s very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.
When a group of people sit down to create a specification, the highest
priority is … utility, common sense? … no, it’s immunity to
criticism.
The easies way to avoid criticism is to allow absolutely anything, even
if
this hurts performance in real-world embodiments that obey the
specification.
Someone might say, “Okay, but can you drop an entire, internally
consistent
CSV database into the boundaries of a single field of another CSV
database,
without any harm or lost data?” Using the present specification, the
committee can say “yes, absolutely.”
But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.
I cannot tell you how many times I have foolishly said, “surely the
specification doesn’t allow that!”, and I cannot remember actually ever
being right after taking such a position. When I make assumptions about
committees, I am always wrong.