Daniel T. wrote in post #1179995:
Marcel Diallo wrote in post #1179942:
Hello all,
I need to make a script to find the longest sequence of DNA bases that
appears within a sequence of the genome of a fly that I have copied onto
a text file. It’s made of up of one continuous string of letters (A, C,
G, T).
I have so far only found solutions to find the longest repeated
substring in 2 or more arrays.
If any of you could give an insight on how I could go about doing it for
one single string made up of only the aforementioned letters.
Thank you in advance for all the help!
Please note my knowledge of Ruby is very limited, write as if to a
dummy.
Please also note the txt file attached is on the larger side (approx
3Mb).
EDIT: the substring has to appear at least 3 times.
I am pleased to inform you that I could find a very long sequence, it
consists of 10283 bases and it appears exactly 3 times.
You can find that sequence at starting positions 17159, 28665 and 40336
in your file. I only used the 500_000 first bases of your file to make
shorter the time required for the computation of such a sequence. The
code for the computation will not be disclosed, sorry about that.
Anyway,
one can check easily that the described sequence of length 10283 appears
three times as required and it is located in the given starting
positions.
The results and code follows, but it is coded in Lisp. Results: The
longest sequence has length 11505, it appears 3 times at the positions
given and the three subsequences are not overlapping. The naive way I
used sorting in Ruby make the computation takes several minutes, so and
I used another language, Lisp, for exploring the solutions.
longest-common-substr for k = 11505 is (11508 53395)
longest-common-substr for k = 11506 is 2639
Hence, the greatest k such that f(k) >= k is 11505
The substrings are located at positions 27441, 15935 and 39112
Code follows in Lisp, it takes about 30 seconds. Using sorting in ruby
with
1.upto(bases.length).to_a.sort {|x,y| bases[x … -1] <=> bases[y …
-1]}
takes a lot of time (several minutes, while lisp (sbcl) takes 30
seconds).
From the documentation of the function to compute the longest sequence:
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.
Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.
Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with
i<=j<=k0.
"
(defparameter bases (with-open-file (f “mosca.txt”
:direction :input
:element-type 'base-char)
(read-line f)))
(defun ordena()
(let ((anum (coerce (loop for i below (length bases) collect i)
'vector)))
(sort anum (lambda(i j) (string< bases bases :start1 i :start2
j)))))
(defparameter anum (ordena))
(defun mis(i j)
(let ((i1 (aref anum i))
(j1 (aref anum j)))
(- (mismatch bases bases :start1 i1 :start2 j1) i1)))
(defun longest-common-substr(k)
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.
Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.
Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with
i<=j<=k0.
"
(loop with maxi = 0 with pos = 0 with aux = 0
for i upfrom 0 below (- (length anum) 3)
for a = (aref anum i)
for b = (aref anum (1+ i))
for c = (aref anum (+ 2 i))
when (and (> (abs (- a b)) k)
(> (abs (- a c)) k)
(> (abs (- b c)) k))
do
(progn
(setq aux (- (mismatch bases bases :start1 a :start2 c) a))
(when (> aux maxi)
(setq maxi aux
pos i)))
finally (return (values maxi pos))))
(multiple-value-bind (maxlen location)
(longest-common-substr 11505)
(format t “longest-common-substr for k = 11505 is ~S~%” (list maxlen
location))
(format t “longest-common-substr for k = 11506 is ~S~%”
(longest-common-substr 11506))
(format t “Hence, the greatest k such that f(k) >= k is 11505~%”)
(format t “The substrings are located at positions ~D, ~D and ~D~%”
(aref anum location) (aref anum (1+ location)) (aref anum (+ 2
location))))