File position and buffers

playballa23 · April 29, 2011, 11:10pm

Hi Jake,

I would still need the header intact, which should be away from the rest
of the entry:

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

instead of:

“>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNAAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG>”

Still need the header line so I can extract information from that and
the lines after that in each entry. Your delete() will delete all the
newlines, which would not be beneficial in this scenario. Thanks for the
input, appreciate it.

-Cee

playballa23 · April 30, 2011, 12:45am

Cee J. wrote in post #995830:

7stud – wrote in post #995821:

I suggest that people never use irb because it has too many quirks.

The first thing you need to realize is that ‘>’ is
not the separator you want to look for. That is the second bit of
erroneous advice your mentor gave you. That’s because you don’t care
what character marks the beginning of every entry, rather you care what
character marks the end of every entry. The end of every entry in your
file is marked by the string “\n\n”, so you should use that as your
input line terminator. Remember, ruby uses “\n” for the input line
separator by default, which means that when you read a file using
IO#each, ruby reads lines–where the end of a line is marked by a
newline.

I understand the logic, it makes sense. What if the file looked like
this, where there is one newline seperating the entries? :

What if you had presented that possibility from the very beginning?

require ‘stringio’

str =<<ENDOFSTRING

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTT
AATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

ENDOFSTRING

input = StringIO.new(str)
buffer = ‘’

input.each do |line|

if line[0, 1] == ‘>’
if buffer != ‘’ #for first entry,
puts buffer #or do something else to buffer
puts ‘-’ * 20
end

buffer = ''
buffer << line

else
buffer << line.sub(/ \n+ \z /xms, ‘’)
end

end

puts buffer #for last entry,
#or do something else to buffer

–output:–

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG

If the entries in a file will all be separted by “\n\n” or they will all
be separated by “\n”, then you could also ask for some user input:

print "What’s the entry separator: "
sep = gets.chomp

Then:

input.each(sep) do |section|
…