Hi,
I need to count the number of occurrences of each word in a large text file (>5GB).
I was thinking about creating a HashMap and while traversing the file I will update the count for each word as I go along.
I can’t read the whole file at once because it will bog down my memory, so I thought of reading it in chunks. I could only find ways of reading a text file line-by-line in this link:
However, I couldn’t find a way to read chunks of the file in case it doesn’t have line breaks (one large row). How can I, for example, read chunks of 1000 words in each read?
I don’t have access to that big file to test. But I would like to say don’t use the IO#readlines or IO#read methods because they just reads the whole file and keep the content on the RAM. Use IO.foreach for maximum efficiency…
Here’s a problem I faced:
So basically you just use the evaluator and add items to an array based on your need.
The hard part will be determining what constitutes a word and what constitutes a word delimiter. Regular expressions should help here if you can determine what kind of ‘text’ file you are working with.
First, thanks for the response!
Let’s say that a word is anything between two spaces, disregarding punctuations (!()?, and so forth).
I don’t think that I will consider the case if we have apostrophes (for example “it’s”) in the word because it complicates the parsing too much.
I generated a 1.2 GiB text file by reading words from /usr/share/dict/words, which repeats the words file 1200 times over and over. But I think I created a wrong file.
If you can give us a small sample of your file, say the first 20 lines, it would be very helpful… Otherwise you could just go with IO#foreach which will not use a huge amount of memory…
I don’t have a sample just yet but to generate a sample that will do, any article in English would be ok, removing the line breaks using this website:
and duplicate the text, for that matter, enough times until a large file size is achieved.
Something like this (just way larger): https://ufile.io/sn9zrlu6
I see what you are getting at now. You are trying to account for a huge text file that may be one large line(no newlines).
Well Ruby allows the programmer to determine what the input record separator character might be… I’d choose a space and read word after word. This sounds inefficient but it really isn’t as inefficient as it sounds because most operating systems read in a page(4k-8k of data) of data whether you read a single character or word or line.
$/
The input record separator (newline by default). gets, readline, etc., take their
input record separator as optional argument.
Here’s an example using the input record separator($/).
#! /usr/bin/env ruby
if __FILE__ == $0
str = "This is the first"
p str.lines.map{|l| l.chomp}
orig = $/
$/ = ' '
p str.lines.map{|l| l.chomp}
$/ = orig
p str.lines.map{|l| l.chomp}
end
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.