Splitting a CSV file into 40,000 line chunks

dfg59 · November 29, 2006, 4:32pm

All -

I’ve written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

infile -> xyz.csv

output -> xyz_part_1.csv
xyz_part_2.csv
etc…

My code is working but I don’t find it very “rubyish”. In particular, I
hate having my index and counter counters and I don’t like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like “rows = CSV.open(infile)” because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

require ‘csv’

infile = ARGV[0] if ARGV[0] != nil

counter = 1
index = 0
header = “”
writer = CSV.open(infile.gsub(/./,“part”+counter.to_s+"."),‘w’)

CSV.open(infile, ‘r’) do |row|
if(index != 0 && index%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/./,“part”+counter.to_s+"."),‘w’)
writer << header
end
if (index == 0)
header = row
end
writer << row
index += 1
end

writer.close()

dfg59 · November 29, 2006, 5:35pm

On 11/29/06, Drew O. [email protected] wrote:

My code is working but I don’t find it very “rubyish”. In particular, I
hate having my index and counter counters and I don’t like that I had to
declare my header variable outside of the loop. Bear in mind here that I
can not do something like “rows = CSV.open(infile)” because ruby will
yell and error as the input file is too big (250 mb). Any advice on
making the code nicer is appreciated. The current code is as follows:

File.open(“test”, “w+”) do |f|
20.times do |i|
f.write “abc|def|ghi#{i}\n”
end
end

File.open(“test”, “r”) do |fin|
part = 0
while row = fin.gets

File.open("test_part#{part}", "w+") do |f|
  5.times do
    f.write(row)
    break unless row = fin.gets
  end
end

part += 1

end
end

dfg59 · November 29, 2006, 5:51pm

Drew O. wrote:

      xyz_part_2.csv
      etc...
My code is working but I don’t find it very “rubyish”. In particular, I
hate having my index and counter counters

Consider that the problem is one of counting input lines. In a case like
this, it is not possible to avoid using a counter. It’s in the nature of
the problem to be solved.

counter = 1
index = 0
header = “”
writer = CSV.open(infile.gsub(/./,“part”+counter.to_s+"."),‘w’)

CSV.open(infile, ‘r’) do |row|

Why are you using CSV for this? You aren’t parsing the lines into
fields, so
the fact that they contain CSV content has no bearing on the present
task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series
of
output files, until the input file is exhausted?

#!/usr/bin/ruby -w

max_output_lines = 1000

input_file = “test.txt”

output_base = “output”

n = 0

ifile = File.open(input_file,“r”)

header = ifile.gets

until(ifile.eof?)
ofn = output_base + sprintf("%03d",n) + “.txt”
ofile = File.open(ofn,“w”)
ofile.write(header)
line = 2
until(ifile.eof? || line > max_output_lines)
ofile.write(ifile.gets)
line += 1
end
ofile.close
n += 1
end

ifile.close

Just change the number for “max_output_lines” to suit your requirement.

dfg59 · November 29, 2006, 5:53pm

On 11/29/06, Drew O. [email protected] wrote:

I’ve written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

Not ruby, but why not use “split -l 40000”?

Cheers,
/Nick

dfg59 · November 29, 2006, 5:59pm

Paul L. wrote:

CSV.open(infile, ‘r’) do |row|

Why are you using CSV for this? You aren’t parsing the lines into fields, so
the fact that they contain CSV content has no bearing on the present task.
Your goal is to split the input file into groups of lines delimited by
linefeeds, not fields delimited by commas.

Why not simply read lines from the input file and write them to a series of
output files, until the input file is exhausted?

Because CSV understands csv data with embedded newlines:

[will@localhost ~]$ cat csvfile.csv
some, field,“new
line”
other,field,here
[will@localhost ~]$ cat test.rb
require ‘csv’

CSV.open(‘csvfile.csv’, ‘r’) do |row|
p row
end
[will@localhost ~]$ ruby test.rb
[“some”, " field", “new\nline”]
[“other”, “field”, “here”]

will.

dfg59 · November 29, 2006, 6:02pm

Nice Simon

But the ‘while row = fin.gets’ is skipping a row each time around and
you don’t handle the header. My “fix”:
Simon S. wrote:…

File.open(“test”, “w+”) do |f|
20.times do |i|
f.write “abc|def|ghi#{i}\n”
end
end

File.open(“test”, “r”) do |fin|
part = 0
header = fin.gets
row = fin.gets
while row
File.open("test_part#{part}", "w+") do |f|

     f.write(header)

  5.times do
    f.write(row)
    break unless row = fin.gets
  end
end

part += 1

end
end
…

Cheers
Chris

dfg59 · November 29, 2006, 6:10pm

On 11/29/06, ChrisH [email protected] wrote:

Nice Simon

But the ‘while row = fin.gets’ is skipping a row each time around and
you don’t handle the header. My “fix”:

I usually mess up in my loops like this
Thanks

dfg59 · November 29, 2006, 5:59pm

On Nov 29, 2006, at 10:53 AM, Nick S. wrote:

On 11/29/06, Drew O. [email protected] wrote:

I’ve written a script to split a .csv file into smaller .csv files of
40,000 lines each. The intent here is to break the file down
enough so
that excel does not have issues reading each chunk. My code takes a
filename from the command line and breaks it down as so:

Not ruby, but why not use “split -l 40000”?

Because the sample code is also moving the headers to the new files.

James Edward G. II

dfg59 · November 29, 2006, 6:38pm

Will J. wrote:

Because CSV understands csv data with embedded newlines:

A plain-text CSV file uses linefeeds as record delimiters. A program
that
uses “readline” or “gets” splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds embedded in
fields.

[will@localhost ~]$ cat csvfile.csv
some, field,“new
line”
other,field,here

Ah, yes, another terrific “improvement” in database protocols. Sometimes
I
wonder what these standards committees are smoking.

dfg59 · November 29, 2006, 6:28pm

James Edward G. II wrote:

On Nov 29, 2006, at 10:53 AM, Nick S. wrote:

Not ruby, but why not use “split -l 40000”?

Because the sample code is also moving the headers to the new files.

James Edward G. II

Also since he’s feeding Excel this is most likely on Windows rather
than *NIX

Cheers
Chris

dfg59 · November 29, 2006, 7:10pm

Thanks for all the responses. As noted in a post above, I am trying to
place the headers at the beginning of each split file. Should have
mentioned that earlier. Also, regarding the above responses, I am
working on a Windows system, hence not being able to use nice shell
scripts.

Anyway, I seems that my logic was on the right track, the use of ugly
counters just made my inner-rubyist upset. Thanks for all the feedback,
I will take a look and reading/writing the files using the File object
as opposed to the CSV object. In fact, I didn’t even think about that.
In previous scripts I had been accessing the CSV fields individually
but, as it is unnecessary here, I will toy around with performance gains
using File object.

Speaking of performance, using the code I’ve described seemed to be
quite slow. This is understandable based on the fact that the input file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects. Also,
any other tips on increasing the performance of this code would be
helpful.

Again, thanks for all the responses. I am continually impressed by the
attitude and intelligence of this community.

dfg59 · November 29, 2006, 7:38pm

On Nov 29, 2006, at 12:10 PM, Drew O. wrote:

Speaking of performance, using the code I’ve described seemed to be
quite slow. This is understandable based on the fact that the input
file
has 1 million+ records, however I was wondering if I would see
performance gain by using File objects as opposed to CSV objects.
Also,
any other tips on increasing the performance of this code would be
helpful.

CSV is known to be pretty slow, thus FasterCSV.

James Edward G. II

dfg59 · November 29, 2006, 6:48pm

On Nov 29, 2006, at 9:32 AM, Drew O. wrote:

Any advice on making the code nicer is appreciated.

I’ll let you decide if this is any better:

Firefly:~/Desktop$ ls
gen_csv.rb split_csv.rb
Firefly:~/Desktop$ ruby gen_csv.rb
Firefly:~/Desktop$ ls
data.csv gen_csv.rb split_csv.rb
Firefly:~/Desktop$ cat data.csv
rec_no,whatever
0,“blah, blah, blah, …”
1,“blah, blah, blah, …”
2,“blah, blah, blah, …”
3,“blah, blah, blah, …”
4,“blah, blah, blah, …”
5,“blah, blah, blah, …”
6,“blah, blah, blah, …”
7,“blah, blah, blah, …”
8,“blah, blah, blah, …”
9,“blah, blah, blah, …”
10,“blah, blah, blah, …”
11,“blah, blah, blah, …”
12,“blah, blah, blah, …”
13,“blah, blah, blah, …”
14,“blah, blah, blah, …”
15,“blah, blah, blah, …”
16,“blah, blah, blah, …”
17,“blah, blah, blah, …”
18,“blah, blah, blah, …”
19,“blah, blah, blah, …”
Firefly:~/Desktop$ ruby split_csv.rb data.csv
Firefly:~/Desktop$ ls
data.csv data_part_2.csv data_part_4.csv gen_csv.rb
data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
Firefly:~/Desktop$ cat data_part_1.csv
rec_no,whatever
0,“blah, blah, blah, …”
1,“blah, blah, blah, …”
2,“blah, blah, blah, …”
3,“blah, blah, blah, …”
Firefly:~/Desktop$ cat data_part_3.csv
rec_no,whatever
8,“blah, blah, blah, …”
9,“blah, blah, blah, …”
10,“blah, blah, blah, …”
11,“blah, blah, blah, …”
Firefly:~/Desktop$ cat split_csv.rb
#!/usr/bin/env ruby -w

require “rubygems”
require “faster_csv”

original_file = ARGV.shift
counter = 1

FCSV.open(original_file.sub(".", “part#{counter}.”), “w”) do |out|
FCSV.open(original_file, :headers => true).each_with_index do |
row, index|
if (index % 4).zero?
if index.nonzero?
counter += 1
out.reopen(original_file.sub(".", “part#{counter}.”), “w”)
end
out << row.headers
end
out << row
end
end

END

James Edward G. II

dfg59 · November 29, 2006, 7:48pm

On Nov 29, 2006, at 11:29 AM, Paul L. wrote:

delimited
uses “readline” or “gets” splits the records just as a sane CSV parser
would. And IMHO a CSV file should never, ever have linefeeds
embedded in
fields.

Your opinion doesn’t make you right on this one. The CSV RFC clearly
defines handling for carriage-returns and linefeeds. They certainly
are allowed in fields. Here is a link to the document, in case you
want to read up:

http://www.ietf.org/rfc/rfc4180.txt

Not to use a CSV parser on this task would be shooting yourself in
the foot. The result using a simple File object would be broken and,
much worse, it might look OK for a while. You just can’t be sure you
are never going to split a CSV file that has an embedded linefeed in
it (especially since that’s perfectly legal), and when you do you
will be responsible for destroying data. There’s just no reason for
that.

I know you’re a don’t-use-a-library guy and you know I disagree.
This is the reason why. The edge cases will get you every time.

James Edward G. II

dfg59 · November 29, 2006, 7:56pm

James -

I’ll give FasterCSV a try when I get home from work and out from behind
this proxy. Here’s another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I’m trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The code
and output are below. How else can I go about this?

Code:

require ‘csv’

infile = “”

if ARGV[0] != nil
infile = ARGV[0]
else
puts “Please enter a file to split”
exit
end

puts “loading file”

rows = CSV.read(infile)

puts “file loaded”

rows.sort!{|x,y| x[3] <=> y[3]}

counter = 1
header = “”
writer = CSV.open(infile.gsub(/./,“part”+counter.to_s+"."),‘w’)

rows.each_index do |i|
if(i != 0 && i%40000 == 0)
writer.close
counter+=1
writer = CSV.open(infile.gsub(/./,“part”+counter.to_s+"."),‘w’)
writer << header
end
if (i == 0)
header = row
end
writer << row
end

writer.close()

Output:

loading file
The exception unknown software exception (0xc00000fd) occured in the
application at location…etc

dfg59 · November 29, 2006, 8:22pm

On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:

Also since he’s feeding Excel this is most likely on Windows rather
than *NIX
not a show stopper…
http://www.research.att.com/sw/tools/uwin/

http://unxutils.sourceforge.net/

dfg59 · November 29, 2006, 8:04pm

On Nov 29, 2006, at 12:56 PM, Drew O. wrote:

Here’s another question: in some cases I need to sort the
file before splitting it (in this case sorting by the 4th cell in each
row). However, the current file I’m trying to sort and split is around
76 MB and ruby fails when trying to store the CSV as an array. The
code
and output are below. How else can I go about this?

Hmm, that’s a good question.

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That’s not going to be fast
with any library helping you, but I don’t have a better idea.

James Edward G. II

dfg59 · November 29, 2006, 8:44pm

Reid T. wrote:

On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:

Also since he’s feeding Excel this is most likely on Windows rather
than *NIX
not a show stopper…
http://www.research.att.com/sw/tools/uwin/

http://unxutils.sourceforge.net/

Reid -

I also have cygwin installed on my box, so if there is a simply way to
do this in unix I’d love to know. My unix is rusty (that’s being
generous).

dfg59 · November 29, 2006, 9:57pm

On Thu, 2006-11-30 at 04:04 +0900, James Edward G. II wrote:

Perhaps you can collect just the key values in an Array and then use
those to reorder the lines bit by bit. That’s not going to be fast
with any library helping you, but I don’t have a better idea.

James Edward G. II
sort -n +4 out
or
sort +4 out

rthompso@jhereg:~$ head -50 out
this is a test 10202
this is a test 23990
this is a test 11056
this is a test 9606
this is a test 28590
this is a test 18264
this is a test 12902
this is a test 12856
this is a test 27571
this is a test 5495
this is a test 15965
this is a test 22229
this is a test 18865
this is a test 31339
this is a test 21913
this is a test 406
this is a test 8602
this is a test 5329
this is a test 10048
this is a test 6458
this is a test 20069
this is a test 19771
this is a test 21844
this is a test 24719
this is a test 30894
this is a test 25239
this is a test 9900
this is a test 1727
this is a test 12042
this is a test 20832
this is a test 23735
this is a test 28768
this is a test 10283
this is a test 390
this is a test 10480
this is a test 1337
this is a test 2745
this is a test 26398
this is a test 32288
this is a test 3797
this is a test 22251
this is a test 458
this is a test 14679
this is a test 29642
this is a test 19943
this is a test 26342
this is a test 24232
rthompso@jhereg:~$ sort +4 out
this is a test 0
this is a test 10
this is a test 10
this is a test 100
this is a test 100
this is a test 100
this is a test 1000
this is a test 1000
this is a test 10001
this is a test 10001
this is a test 10005
this is a test 10006
this is a test 10006
this is a test 10007
this is a test 10007
this is a test 10008
this is a test 10008
this is a test 10009
this is a test 10009
this is a test 1001
this is a test 10010
this is a test 10011
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10012
this is a test 10013
this is a test 10014
this is a test 10014
this is a test 10015
this is a test 10017
this is a test 10017
this is a test 10017
this is a test 10018
this is a test 10019
this is a test 1002
this is a test 1002
this is a test 10021
this is a test 10026
this is a test 10026
-------------------- OR
rthompso@jhereg:~$ sort -n +4 out
this is a test 0
this is a test 2
this is a test 3
this is a test 3
this is a test 4
this is a test 7
this is a test 10
this is a test 10
this is a test 12
this is a test 13
this is a test 14
this is a test 14
this is a test 15
this is a test 16
this is a test 16
this is a test 20
this is a test 21
this is a test 24
this is a test 25
this is a test 28
this is a test 29
this is a test 29
this is a test 29
this is a test 30
this is a test 30
this is a test 31
this is a test 32
this is a test 32
this is a test 34
this is a test 35
this is a test 35
this is a test 35
this is a test 36
this is a test 36
this is a test 37
this is a test 37
this is a test 38
this is a test 38
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 41
this is a test 42
this is a test 42
this is a test 42
this is a test 43
this is a test 45
this is a test 46
this is a test 46
this is a test 47
this is a test 47

dfg59 · November 29, 2006, 10:13pm

Drew O. wrote:

     xyz_part_2.csv
counter+=1
writer.close()

I will ignore the CSV issue, not because it isn’t important, but simply
because I’m not familar with the csv parser, and this example
sufficiently represents the concept.

For maximum elegance, I would write the code this way. It uses the
helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
Enumerable#chunks, all of which I’ve written at one time or another.
Mentally sub in the appropriate code as desired.

File.to_a(‘xyz.csv’).chunks(40000).each_with_index do |chunk,i|
File.write_fresh(“xyz_part_#{i+1}”,chunk.join("\n"))
end

File.to_a returns an array of lines
Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
array would yield 2 40k chunks and a 20k chunk.
File#write_fresh creates the file if it doesn’t exist, truncates any
existing file, and writes the 2nd argument to the file.

This version is much prettier than the corresponding version without the
helper methods, but it is also clearer. It is obvious at a glance what
it does. The same can’t be said for the version without helper methods.